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PCTAJS97/D1491 



METHOD OF CREATING AND SEARCHING A MOLECULAR VIRTUAL UBRARY USING 
VALIDATED MOLECULAR STRUCTURE DESCRIPTORS 



— A-portion of the disclosure of this patent document contains niatmal which is subject 
to copyright protection. The o^yright owner has no objection to the facsimile rqiroduction 
by anyone of the patent document or the patent disclosure, as it appears in the U.S. Patent and 
Trademark Office, WIPO, or any national patent office patent file or records, but otherwise 
5 reserves ail copyright rights whatsoever. 
Technical Field 

This invwition relates to the field of molecular structure/activity analysis and more 
specifically to: 1) a method of validating molecular structural descriptors; 2) a method using 
validated molecular descriptors to design an q)timally diverse combinatorial screening library; 

10 3) a method of merging libraries dmved from different combinatorial chemistries; 4) a method 
using validated molecular descriptors of generating a searchable virtual library of molecules 
which can be combinatorially d^ved; 5) methods of searching the virtual library for 
oombinatorially derived product molecules which meet specified criteria; and 6) methods of 
following up and optimizing identified leads. The screening libraries designed by the methods 

IS of this invention are constructed to ensure that an optima] structural diversity of compounds 
is r^resented. The search methods of the invention ensure that the same diversity space is not 
ovo^ampled and that compounds can be identified having a high likelihood of possessing the 
.same structure and/or activity of a lead compound. In particular, Jhe invention describes the 
design of libfaries of small molecules to be used for pharmacological testing. 

20 Papkgyffvnd Art 

Statement Of The Problem 

While the present invention is discussed with detailed reference to the search for and 
identification of pharmacologically useful chemical compounds, the invention is applicable to 
any attempt to search for and identify chemical compounds which have some desired physical 
25 or chemical characteristic(s). The broader teachings of this invention are easily recognized if 
a different functional utility or useful property describing other chemical systems is substituted 
below for the term "biological activity". 
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Starting with the seraidipitous discovery of praicillin by Fleming and the subsequrat 
directed searches for additional antibiotics by Waksman and Dubos, the field of drug discovery 
during the post World War n era has been driven by the belief that nature would provide many 
needed drugs if <mly a careful and diligent search for them was conducted. Consequently, 
5 pharmacratical companies undertook miassive screening programs which tested samples of 
natural products (typically isolated from soil or i^ants) for their biological properties. In a 
parallel effort to increase-the effectiveness of the discovered *'lead'' compoumls, medidnal 
chemists learned to syntheaze derivatives and analogs of the compounds. Over the years, as 
biochanists identified new enzymes and biological reactions, large scale screwing continued 

10 as compounds were tested for biological activity in an ever rapidly expanding number of 
biochemical pathways. However, proportionately fewer and fewer lead compounds possessing 
a desired therap^tic activity have been discovered. In an attempt to extend the range of 
compounds available for testing, during the last few years the search for unique biological 
materials ha$ been extended to all comers of the earth including sources from both the tropical 

IS rain forests and the ocean. Despite these and other efforts, it is estimated that discovery and 
develq>ment of each new drug still takes about 12 years and costs on the order of 350 million 
dollars. 

Beginning approximately twenty-five years ago, as bioscientists learned more about the 
ch^ical and stereochemical requirements for biological interactidns, a variety of semi- 

20 empirical, theoretical, and quantitative ajpnoaches to drug design were developed. These 
s^roaches , were accelerated by the availability of powerful computers to perform 
computational chemistry. It was hoped that the era of "rational drug design" would shorten the 
time b^een significant discoveries and also provide an sq)proach to discovering compounds 
active in biological pathways for whidi no drugs^iad y^ been discovered. In large part, this 

2S work was based cm the accumulated observation of medicinal chemists that compounds which 
were structurally similar also possessed similar biological activities. While significant strides 
were made using this approach, it too, like the mass screening programs, failed to provide a 
solution to the problem of rapidly discovering new compounds with activities in the ever 
increa^ng number of biological pathways being elucidated by modem biotechnology. 

30 During the past four or five years, a revised screening approach has been under 

development which, it was hoped, would accelerate the pace of drug discovery. In fact, the 
qyproach has been remarkably successful and represents one of the most active areas in 
biotechnology today. This new approach utilizes combinatorial libraries against which 
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biological assays are screened. Combinatorial libraries are collections of molecules generated 
by synthetic pathways in which either 1) two groups of reactants are combined to form 
products; or 2) one or more positions on coie molecules are substituted by a diffmnt diemical 
constituent/moiety selected from a large number of possible constituents. 
5 Two fundamental ideas underlie combinatorial screening libraries. The first idea, 

common to all drug researdi, is that somewhere amongst the diversity of all possible chmical 
structures th^ exist molecules which have the appropriate-shs^ and binding pr(q)^es to 
interact with any biological sy^em. The second idea is the belief that synthesizing and testing 
many molecules in parallel is a more effid^t way (in terms of time and cost) to find a 

10 molecule possessing a desired activity than the random testing of compounds, no matter what 
their source. In the broadest context, these ideas require that, since the binding requiremoits 
of a ligand to the biological systems under study (enzymes, membranes, receptors, antibodies, 
whole cell preparations, genetic materials, etc.) are not known, the screened compounds should 
possess as broad a range of characteristics (chemicsd and physical) as possible in order to 

15 increase the likelihood of finding one that is appropriate for any given biological target. This 
requirement for a screening library is reflected in the term "diversity" - essentially a way of 
suggesting that the library should contain as great a dissimilarity of compounds as possible. 

However, as is immediately apparent, a combinatorial approach to synthesizing 
molecules generates an immrase number of compounds many with a high degree of structural 

20 similarity. In fact, the number of compounds synthetically accessible with known organic 
reactions exceeds by many orders of magnitude the numbers which can actually be made and 
tested. One area where these ideas were first explored is in the design of peptide libraries. For 
a library of five member peptides synthesized using the 20 naturally occurring amino adds, 
3,200,(XX), (2(f) different j)q)tides may be constructed, TTie number of combinatorial 

25 possibilities increases even more dramatically when non-peptide combinatorial libraries are 
considwed. Witii non-peptide libraries, the whole synUietic chemical universe of combinatorial 
possibilities is available. Ubrary sizes ranging from 5 X 10^ to 4 X 10" molecules are now 
bdng discussed. The enormous universe of chemical compounds is both a blessing and a curse 
to medicinal chemists seeking new drugs. On the one hand, if a molecule exists with tiie 

30 de^red biological activity, it should be included in Uie chemical universe. On Uie other hand, 
it may be impossible to find. Thus, the principal focus of recent efforts has been to define 
smaller screening subsets of molecules derivable from accessible combinatorial syntheses 
without losing the inherent diversity of an accessible universe. 
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To date, in order to narrow the focus of the search and reduce the numbo* of 
compounds to be screened, attention has been directed to designing biologically specific 
libraries. Hus, many combinatorial screening libraries existing in the prior art have been 
designed based on prior knowledge about a particular biological system such as a known 
5 pharmacophore (a geomc^c arrangement of structural fragments abstracted from molecular 
structures known to have activity). Even with this knowledge, molecules are included in these 
prior ait libraries based on intuition - "seat of tf)e pants" estimations of likely similarity based 
on an intuitive "fed" for the systems und^ study. Hiis procedure is essentially pseudo-iandom 
screening, not rational library design . Several biotechnology startup companies have developed 

10 just such proprietary libraries, and success using combinatorial libraries has been achieved by 
sheer effort. In one example 18 libraries containing 43 million compmmds were screened to 
idaitify 27 active compounds^ With library searches of this magnitude, it is most likely that 
the enormous number of inactive molecules [(43 X 10*) - 27] must have included staggering 
numbers of redundantly inactive molecules - molecules not significantly distinguishable from 

IS one another - even in libraries designed with a particular biological target in mind. Qearly, 
when searching for a lead molecule which interacts with an uncharacterized biological target, 
ai^roaches requiring knowledge of the biological targets will not work. But fmding such a lead 
is exactly the case for which it is hoped general purpose screening libraries can be designed. 
If the promise of combinatorial chemistry is ever to be fully realized, some rational and 

20 quantitative* method of reducing the astronomical number of compounds accessible in the 
combinatorial chemistry universe to a number which can be usefully tested is required. In oAer 
words, the efficiency of the search process must be increased. For this purpose, a smaller 
rationally designed screening library, which still retains the diversity of the combinatorially 
accessible compounds, is absolutely necessary. 

25 Thus, there are two criteria which must be met by any screening library subset of some 

universe of combinatorially accessible compounds. First, the diversity, the dissimilarity of the 
universe of compounds accessible by some combinatorial reaction, must be retained in the 
screening subset. A subset which does not contain examples of the total range of diversity in 
such a imiverse would potoitially miss critical molecules, thereby frustrating the very reason 

30 for the creation of the subset. Second, for efficient screening, the ideal subset should not 
contain more than one compound repres^tative of each aspect of die divenity of the larger 
group. If more than one example were included, the same diversity would be tested more than 
once. Such redundant screening would yield no new information while simultaneously 
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increasing the number of compounds which must be synthesized and screened. Therefore, the 
fundamental problem is how to reduce to a manageable number the numb^ of compounds that 
need to be synthesized and tested while at the same time providing a reasonably high 
probability that no posdble molecule of biological importance is overlooked (In this regard, 
S it should be recognized that the only way of absolutely insuring that all diversity is rq)resented 
in a library is to include and test all compounds.) A concq)tual analogy to the problem might 
be: «4iat Idnd of filter can be constructed to sort out from the middle of a blinding snowstcmn 
individual snowflakes which represent all the classes of crystal structures which snowflakes can 
form? 

10 The fundamental question plaguing progress in this area has been whether the ooncq>t 

of the diversity of molecular structure can be usefully described and quantified; that is, how 
is it possible to compare/distinguish the phy^cal and chemical properties determinative of 
biological activity of one molecule with that of another molecule? Without some way to 
quantitatively describe divmity, no meaningful filter can be constructed. Fortunatdy, for 

15 biological systems, the accumulated wisdom of bioscientists has recognized a general principle 
alluded to earlier which provides a handle on this problem. As framed by Johnson and 
Maggiora^, the principle is simply stated as: "structurally similar molecules are expected to 
exhibit similar (biological) properties. " Based on this prindple, quantifying divo^ity becomes 
a mat^ of quantifying the notion of structural similarity. Thus, for design of a screening 

20 subset of a combinatorial library (hereafter referred to as a "combinatorial screening library"), 
it should only be necessary to identify which molecules are structurally similar and whidi 
structurally dissimilar. According to the selection criteria outlined above, one molecule of each 
structurally similar group in the combinatorially accessible chemical universe would be 
included in the library subset. Such a library would be an c^timally diverse combinatorial 

25 soeening library. The problem for medidnal chemists is to determine how the intuitivdy 
perodved notions of structural similarity of chemical compounds can be validly quantified. 
Once this question is satisfactorily answered, it should be possible to rationally design 
combinatorial screening libraries. 
Prior Art Approaches 

30 Many descriptors of molecular structure have been created in the prior art in an attempt 

to quantify structural similarity and/or dissimilarity. As the art has recognized, however, no 
method currently exists to distinguish those descriptors that quantify useful aspects of similarity 
from those which do not. The importance of being able to validate molecular descriptors has 
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been a vexing pF(d)lem restricting advances in the art, and, before this invention, no generally 
applicable and satisfactory answer had been found. The problem may be concq)tualized in 
terms of a multidimensional space of structurally derivable properties which is populated by 
all posable combinatdrially accessible chemical compounds. Compounds lying "near" one 
5 another in any one dimensicm may Ue "far zfmV from one ancrther in another dimension. The 
difficulty is to find a useful design space - a quantifiable dimensional space (metric space) in 
which compounds with similar biological properties clustoi ie., are found measurably near to 
each other. What is desired is a molecular structural descriptor wfai^, when ^qdied to the 
molecules of the chmical universe, defines a dimrasional q>ace in which the "nearness" of 

10 the molecules with respect to a specified characteristic (ie. ; biological activity) in the chemical 
universe is preserved in the dimensional space. A molecular structural descriptor (m^c) 
which does not have this property is useless as a descriptor of molecular divmity. A valid 
descriptor is defined as one which has this property. 

In light of the above, it should be noted that there is a difference between a descriptor 

15 bdng valid and being perfect. There may or may not be a "perfect" nietric which precisely and 
quantitatively maps the diversity of compounds (much less those of biological intoest). 
However, a good a|yproximation is sufficient for purjxises of designing a combinatorial 
scre^ng library and is considered valid/useful. Acceptance of this validation/usefulness 
crituia is essentially equivalent to saying that, if there is a high probability that if one 

20 molecule is active (or inactive), a second molecule is also active (or inactive), then most of 
the time sampling one of the pair will be sufficient. Restating this same principle with a 
slighUy different emphasis highlights another feature, namely: the design criteria for 
combinatorial screening libraries should yield a high probability that, for any given inactive 
molecule, it is more probable to find an active molecule somewhm else rather than as a near 

25 neighbor of that inactive molecule. While this is a probabilistic approach, it emphasizes that 
a good approximation to a perfect metric is sufficient for purposes of designing a combinatorial 
screening library as well as in other situations where the ability to discriminate molecular 
structural difference and similarities is required. A perfect descriptor (certainty) for 
pharmacological searching is not needed to achieve the required level of confidence as long 

30 as it is valid (nuips a subspace where biological properties cluster). 

The typical prior art approach for establishing selection criteria for screwing library 
subsets relied on the following clustering paradigm: 1) characterization of compounds 
according to a chosen descriptdr(s) (metric[s]); 2) calculation of similarities or "distances" in 
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the descriptor (metric) between all pairs of compounds; and 3) grouping or clustering of the 
compounds based on the descriptor distances. The idea behind the paradigm is that, within a 
cluster, compounds should have similar activities and, therefore, only one or a few compounds 
from each cluster, which will be rq>resentative of that cluster, need be included in a library. 
S the actual clustering is done until the prior art user feds comfortable with the groupings and 
their spacing. However, with no knovidedge of the validity/usefulness of the descriptor 
mployed, and no guidance with respect to the size or spadng of clustm to be expected from 
any given descriptor, prior art clustering has been, at best, another intuitive ''seat of the pants** 
a^noach to diversity measurement. 

10 The prior art describes the construction and application of many molecular structural 

descriptors while all the while tacidy acknowledging that little progress has been made towards 
solving the fundamental problem of establishing their validity. The fidd has nevertheless 
proceeded based on the bdief/faith that, by incorporating in the descriptors obtain measures 
which had been recognized in QS^ studies as being important contributors to defining 

IS structure-activity relationships, valid/useful descriptors would be produced. In a leading 
method r^resentative of this prior art approach to defining a similarity descriptor, E. Martin 
et al.^ construct a metric for quantifying structural similarity using measures that characterize 
lipophilicity, shape and branching, chemical functionality, and recq)tor recognition features. 
(For the reasons set forth Isuer in relation to the present invention, Martin et al. qsplied thdr 

20 m&tiic to the reactants which would be used in combinatorial synthesis.) This large set of 
measures is used to generate a statistically blended metric consisting of a total of 16 properties 
for each individual reactant ^udied (S shape descriptors, 5 measures of chemical fiinctimality , 
S xecepun binding descriptors, and one lipophilidty property). This generates a 16 dimmsional 
property space. The 16 properties are simultaneously displayed in a circular "Flower Piots" 

25 gr^^ucal environment, where each properly is assigned a petal. All the plots togedier visually 
disfday how the diversity of the studied reactants is distributed through the computed property 
space. Martin acknowledges that the plots "...cannot, of course, prove that tfie subset is 
diverse in any 'absolute* soise, indq)radent of the calculated properties." (Martin at 1434) 
In another approach rdating to pq)toid design, Martin et al/ have characterized the 

30 varieties of shape that an unknown recqitor cavity might assume by a few assemblages of 
blodcs, called "polyominos". Candidates for a combinatorial design arc classified by the types 
of polyominos into which they can be made to fit, or "docked". The 7 flexible polyomino 
shape descriptors are added to the previously defined 16 descriptors to yield a 23 dimensional 
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piopmy space. Mardn has demonstrated that the docking procedure g»ierates for a 
methotrexate ligand in a cavity of dihydrofolate reductase neatrly the correct structure as that 
established by X-ray diffraction studies. The docking procedure, which must be applied to 
every design candidate for each polyomino, requires a considerable amount of CPU time (is 
S computsuionally esqpensive). However, a prbblem with tiiis approach is the conc^tually severe 
(unjustified) qqnoximaticHi of representing ail possible innegularly diaped reoq>tor cavities by 
only about a dozen assemblies of smooth*si<ted polyomino cubes. Martin has also presented 
no validation of the zpprosndtk^ which in this case, would be a demonstration that molecules 
which fit into the same polyominos ^d to have similar biological prop^es. 

10 CNie approach which has been taken to try to empirically assess the relative validity of 

prior art metrics has been to survey the metrics to see if any of them appeared to be superior 
to any others as judged by clustering analysis. Y. C. Martin et al.^ have reported that 3D 
fingerprints, collections of fiagm^ts defined by pairs of atoms and their accessible interatomic 
distances, perform no better than collections of 2D fragments in defining clusters that sqiarate 

15 biologically active from inactive compounds. As will be se^ later, some of this work pointed 
towards the possible validity of one metric, but the authors concentrated on the comparative 
clustering aspects and did not follow up on the broads import of the data. 

W. Homdon^ among others has pointed out that an experimentally determined similarity 
(}SAR is, by definition, a good test of the validity of that similarity concept for the biological 

20 system from which it is derived and may have some usefulness in estimating diversity for that 
system. However, QSARs essentially map only the space of a [Ocular recq)tor, do not 
provide information about the validity of other descriptors, and would be generally inapplicable 
to construction of a combinatorial screening library designed for screening unknown receptors 
or those for which no QSAR data was available. 

25 Finally, D. Chaqinian ^ al.^ have used thdr "Compass* 3D-(^AR descriptor whic 

based on the three dimensional shape of molecules, the locations of polar functionalides on the 
molecules, and die fixation entropies of the molecules to estimate the similarity of molecules. 
Essentially, using the descriptor, they try to find the molecules which have the maximum 
overlap On geom^c/cartesian space) with each other. The sh^ of each molecule of a series 

30 is allowed to translate and rotate relative to each other molecule and the internal degrees of 
freedom are also allowed to rotate in an iterative procedure until the shapes with greatest or 
least ov^Iap simikirity are identified. Selecting 20 maximally diverse caiboxylic acids based 
on seddng the maximally diverse alignment of each of the 3000 acids considered took 
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^proximately 4 CPU computing weeks by thdr method. No indication was given of whether 
their descriptor was valid in the sense defined above, and, clearly, stidi a procedure would be 
too time consuming to apply to a truly large combinatorial library design. 

One way in which many of the prior art sq^miaches attonpt to work around the 
5 problem of not knowing if a molecular structural descriptor is valid is to try, when clustering, 
to maximize as much as possible the distance between the dusto^ from which compounds will 
be sdected fat indusim in the screening library subset. The thinking behind this approach4s 
that, if the dusters are far enough i^iart, only molecules diverse from each other will be 
chosen. Convosdy, it is thought that, if the clusters are close together, oversampling 

10 (selecticm of two or more molecules representative of the same elemrats of diversity) would 
likdy occur. However, as we have seen, if the metric used in the cluster analysis is not 
initially valid (does not define a subspace in which molecules with similar biological activity 
duster), then no amount of manipulation will prevent the sample from bdng essentially 
random. Worse yet, an invalid metric might not yield a selection as good as random! The 

IS acknowledgement by Kfartin quoted above is a recognition of the prior art*s failure to yet 
discover a general method for validating descriptors. 

Another related problem in the prior art is the failure to have any objective manner of 
ascertaining when the library subset under design has an adequate number of members; that 
is, when to stop sampling. Qearly, if nothing is known about the distribution of the diversity 

20 of molecules, one arbitrary stopping point is as good as any other. Any stopping point may 
or may not sample suffidently or may oversample. In fact, the prior art has not recognized 
a coherent quantitative methodology for determining the end point of selection. Essentially, 
in die prior art, a m^ric is used to maximize the presumed differences between molecules 
(typi(^y in a clustmng analysis), and a very large number of molecules are chosen for 

25 indusion in a screening library subset based on the bdief that there is safety in numbers; that 
sampling more molecules will result in sampling more of the diversity of a combinatorially 
accessible chemical space. As pointed out earlier, however, only by including all possible 
molecules in a library will one guarantee that all of the diversity has been sampled. Short of 
such total sampling, users of prior art library subsets constructed along the lines noted above 

30 do not know whether a random sample, a representative sample, or a highly skewed sample 
has been screened. 

Several other problems flow from the inability to rationally select a combinatorial 
screwing library for optimal diversity and these are related both to the chemistry used to 
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create the combinatorial library and the screening systems used. First, because many more 
molecules may have to be synthedzed than may be needed, mass synthetic schemes have to 
be devised which create many combinations simultaneously* In fact, there is a good deal of 
disagreement in Che [nior art as to whether compounds should be synthesized individuaUy or 
S collectively or in solution or on solid supports. Within any synthetic schme, an additional 
problem is keefnng track of and identifying the combinations created. It should be understood 
that, where rdatively small (molecular weight of less than about ISOO) organic molecules are 
concerned, generally standard, well known, cnganic reactions are used to create the molecules. 
In the case of pq>tide like molecules, standard methods of pq)tide synthesis are empfoyed. 

10 Similarly for polysaccharides and other polymers, reaction schemes exist in the prior art which 
are wdl known and can be utilized. While the synthesis of any individual combinatcmal 
molecule may be straightforward, much time and effort has been and is still bdng expended 
to develop synthetic schemes in which hundreds, thousands, or tens of thousands of 
combinatorial combinations can be synthesized simultaneously. 

IS In many synthetic schemes, mixtures of combinatorial products are synthesized for 

screening in which the identity of eadi individual component is uncertain. Alternatively, many 
different combinatorial products may be mixed together for dmultaneous screening. Each 
additional molecule added to a simultaneous screen means that many fewer individual screening 
operations have to be performed. Thus, it is not unusual that a single assay may be 

20 simultaneously tested against up to 625 or more different molecules. Not until the mixture 
shows some activity in the biological screening assay will an attempt be made to identify the 
compcments. Many approadies in the prior art therefore face ''decon volution** problems; ie. 
tryii^ to figure out what was in an active mixture dther by following the synthetic reaction 
pathway, by resynthesizing the inctividual molecules whidi should have resulted from the 

25 reaction pathway, or by direct analysis of duplicate samples* Some ^proaches even tag the 
Carrie of each different molecule with a unique molecular identifier which can be read when 
necessary. All these problems are significantly decreased by designing a library for optimal 
diversity. 

Another major problem with the inclusion of multiple and potentially non-diverse 
30 compounds in the same screening mixture is that many assays will yield false positives (have 
an activity detected above a certain established threshold) due to the combined effect of all the 
molecules in the screening mixture. The absence of the desired activity is only determined after 
expending the time, effort, and exp^ise of identifying the molecules present in the mixture and 
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testing them individually. Sudi instances of combined reactivity are reduced when the 
screening mixture can be selected from molecules belonging to diverse groups of an optimally 
designed library since it is not as likdy that molecules of different (diversity) structures would 
likely produce a cond>ined effect. 
5 It is dear that a great deal of cleverness has been expended in actually manufacturing 

the combinatorial libraries. While the basic chemistry of synthesizing any given molecule is 
straight forwaud, the next advance in the development-of combinatorial chemistry screenings 
libraries will be q>timizati(Hi of the design of the libraries. 

Further problems in the prior art arise in the attempt to follow up leads resulting from 

10 the screening process. As noted sbovc, many libraries are designed with some knowledge of 
the receptor and its binding requirements. While, within those constraints, all possible 
combinatorial molecules are synthesized for screening, finding a few molecules with the 
desired activity among such a library yields no information about what active molecules might 
exist in the univose accessible with the same combinatorial chemistry but outside the limited 

15 (receptor) library definition. This is an especially troubling problem since, from serendipitous 
experience, it is well known that sometimes totally unexpected molecules widi little or no 
obvious dmilarity to known active molecules exhibit significant activity in some biological 
systems. Hius, even finding a candidate lead in a library whose design was based on 
knowledge of the receptor is no guarantee that the lead can be followed to an optimal 

20 compound. Only a rationally designed combinatorial screening library of optimal diversity can 
approach tfiis goal. 

For prior art library subsets designed around the use of some descriptor to cluster 
compounds, similar problems may exist. In such a library design, one or at most a few 
compounds mil have been selected firom each cluster. Only if the descriptor is valid, does such 

25 a sdectiOT procedure make sense. If the descriptor is not valid, each cluster will contain 
molecules representative of many different diversities and selecting from each cluster will still 
have resulted in a random set of molecules which do not sample all of the diversity present. 
Smce the prior art does not possess a generally applicable method of validating descriptors, 
all screening performed with prior art libraries is suspect and may not have yielded all the 

30 useful information desired about the larger chemical universe from which the library subsets 
were selected. 

Finally, as the expense in time and effort of creating and screening combinatorial 
libraries increases, the question of the uniqueness of the libraries becomes ever more critical. 
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Questions can be asked such as: 1) does library "one" cover the same diversity of chemical 
structures as library "two"; 2) if libraries "one" and "two" cover both different and identical 
aspects of diversity, how much overlap is there; 3) what about the possible overlap with 
libraries "three", "four", "five", etc.? To date, the prior art has been unable to answer these 

S questions. In fact, assumptions have been made that as long as different chemistries were 
involved Oe., protrins, polysaccharides, small organic molecules), it was unlikely that the 
same diversity space was bdng sampled: However, such an assumption contiadicts the well 
known reality that biological reoq>tors can recognize mcdecular ^milarities arising from 
different structures. When screening for compounds possessing activity for undefined biological 

10 Tecq>tors, there is no way of telling a priori which chemistry or chemistries is most likely to 
produce molecules with activity for that recq>tor. Thus, screening with as numy diemistries 
as pos^ble is desired but is only really practical if redundant sampling of the same diversity 
space in each chemistry can be avoided. The prior art has not provided any guidance towards 
the resolution of these problems. 

15 Brief Summarv Of The Invention 

In order to select a screening subset of a combinatorially accessible chemical universe 
which is representative of all the structural variation (diversity) to be found in the universe, 
it is necessary to have the means to describe and compare the molecular structural diversity 
in the universe. The first aspect of the present invention is the discovery of a generalized 

20 method of validating descriptors of molecular structural diversity. The method does not assume 
any prior knowledge of either the nature of the descriptor or of the biological system being 
studied and is generally applicable to all types of descriptors of molecular structure. This 
discovo^ enables several rdated advances to the art. 

The second aspect of die invention is the discovery of a method of generating a 

25 validated three dimensional molecular structural descriptor using CoMFA fields. To generate 
these field descriptors required solving the alignment problem associated with these 
measurements. The alignment problem was solved u^ng a topomeric procedure. 

A third aspect of the invention is the discovery that validated molecular structural 
descriptors applicable to whole molecules can be used both to: 1) quantitatively define a 

30 meaningful end-point for selection in defining a single screening library (sampling procedure); 
and 2) merge libraries so as not to include molecules of the same or similar diversity. It is 
shown tfiat a known metric (Tanimoto 2D fingerprint similarity) can be used in conjunction 
with the sampling procedure for this purpose. 
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A fourth aspect of the invention is the discoveiy of a method of using validated reactant 
and whole molecule molecular structural descriptors to rationally design a combinatorial 
screening library of optimal diversity. In particular, the shape sensitive tqpomeric CoMFA 
desoiptor and the atom group Tanimoto 2D ^milarity descriptor may be used in the library 
S des^. As a braefit of designing a combinatorial screening library of optimal diversity based 
on validated molecular descriptors, many prior art problems associated with the synthesis, 
identification, and-screening of mixtures of combinatorial molecules can be reduced or 
eliminated, 

A fifth aspect of the invention is the use of validated molecular structural descriptors 
10 to guide the search for optimally active compounds after a lead compound has been identified 

by scre^ng. In the case of a screening library designed for optimal div^ity using validated 

descriptors, a great deal of the information necessary for lead optimization flows directly from 

the library design. In the case where a lead has been identified by screening a prior art library 

or through some other means, validated descriptors provide a method for identifying the 
15 molecular structural space nearest the lead which is most likely to contain compounds with the 

same or similar activity. 

A sixth aspect of this invention is the discovery of a method for genmting, using 

validated molecular descriptors, a virtual library of product molecules derivable from 

combinatorial reactions (or which may be represented by a combinatorial SLN [CSLN]) in 
20 which the characteristics of product molecules can be searched and compared without the 

actual construction of the product molecules. This virtual library allows the searching of 

billions of possible product molecules in reasonable amounts of time. 

A seventh aspect of this invention is the discovery that, using validated molecular 

descriptors, the virtual library can be searched over billions of possible product-molecules in 
25 ways to yidd both optimally diverse screening libraries and to follow up on lead explosions. 

Using the virtual library, a much larger ftaction of the diemically accessible universe can be 

searched for molecules of interest. 

An eighth aspect of this invention is the discovery of a way to search, using validated 

molecular descriptors, the virtual library for possible molecules which have similar structures 
30 and/or activities to a query molecule which is not necessarily derived from a combinatorial 

synthesis. This discovery opens up a whole new method for seeking molecules with similar 

characteristics to a previously identified molecule. 

It is an object of this invention to define a general process which may be used with 
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randomly sdected literature data sets to validate molecular structural descriptors. 

It is a further object of this invention to define a process to derive CoMFA stmc fields 
(and, if desired, additional relevant fields) using topomeric alignment so that the resulting 
descriptor is valid. 

S It is a further object of this invention to teach that topomeric alignments may be used 

to describe molecular conformations. 

It is a ftuther object of this invention to define a general process for using a validated' 
molecular descrij^or to establish a meaningfid end-point for the sampling of compounds 
thereby avoiding the oversampling of compounds rqiresenting the same molecular structural 
10 characteristics. 

It is yet a furtho* object of this invention to design an optimally diverse combinatorial 
screaiing library using multiple validated molecular structural descriptors. 

It is a further object of this invention to use the topomeric CoMFA molecular structural 
descriptor as a reactant descriptor in the design of an optimally diverse combinatorial screening 
15 library. 

It is a further object of this invention to use the Tanimoto 2D similarity molecular 
structural descriptor as a product descriptor in the design of an optimally diverse combinatorial 
screening library. 

It is a further object of this invention to define a method for merging assemblies of 
20 molecules Oibraries), both those, designed by the methods of this invention and others not 
designed by the methods of this invention, in such a manner that molecules rq)resenting the 
same or similar diversity space are not likely to be included. 

It is a further object of this inv^tion to define m^ods for the use of validated 
molecular structural descriptors to guide the search for q>timallytictlve compounds after a lead 
25 compound has been identified by screening or some other method. 

It is a further object of this invention to generate a virtual library^ using validated 
molecular descriptors, of potential product molecules derivable from combinatorial reactions 
(or which may be represented by a combinatorial SLN [CSLN]) which can be searched for 
molecules having desired diaractoistics. 
30 It is a further object of this invention to define methods for creating optimal diversity 

screening libraries as subsets of the virtual library. 

It is still a further object of this invention to locate within the virtual library possible 
product molecules similar in structure and/or activity to lead compounds. 
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These and further objects of the invention will become apparent from the detailed 
description of the invention which follows. 

Brief Description of Drawings 

Hgure 1 schematically shows the distribution of molecular structures around and about 
S an island of biological activity in a hypothedcal two dimensional metric space for a poorly 
designed prior art library and for-an efficiently designed optimally diverse-screening library. 

Figure 2 shows a theoretics^ scatter plot (Patterson Plot) for a metric having the 
neighborhood prop^ in which the X axis shows distances in some metric space calculated 
as the absolute value of the pairwise differences in some candidate molecular descriptor and 
10 the Y axis shows the absolute value of the pairwise differences in biological activity. 
Figure 3 shows a Patterson plot for an illustrative data set. 

Figure 4 shows a Patterson plot for the same data set as in Figure 3 but where the 
diversity descriptor values (X axis) associated with each molecule have been replaced by 
random numbers. 

15 Figure 5 shows a Pattnson plot for the same data set as in Figure 3 but where the 

divCTsity descriptor values (X axis) associated with each molecule have been r^laced by a 
normalized force field strain energy/atom value. 

Figure 6 shows three molecular structures numbered and marked in accordance with 
the topomeric alignment rule. 
20 Figure 7 is a complete set of Patterson plots for the twenty data sets used for the 

validation studies of the topomeric CoMFA descriptor. 

Figure 8 shows the two scatto^ plots displaying the relation b^een X* values and thdr 
corresponding density ratio values for the tested metrics over the twenty random data sets. 
Figure 9 diows die graphs of the Tanimoto similarity measure vs. the pairwise 
25 frequency of active molecules for 18 groups examined from Index Chemicus, 

Figure 10 shows a Patterson plot of the Cristalli data set using only those values which 
would have been used for a Tanimoto sigmoid plot of the same data set alongside a Patterwn 
plot of the complete data set. 

Figure 11 is a schematic of the combinatorial screening library design process. 
30 Figure 12 shows a comparison of the volumes of space occupied by different molecules 

which are determined to be similar according to the Tanimoto 2D fingerprint descriptor but 
which are determined to be dissimilar according to the topomeric CoMFA field descriptor. 
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Figure 13 shows a plot of the Tanimoto 2D pairwise similarities for a typical 
combinatorial product universe. 

Figure 14 shows the distribution of molecules resulting from a combinatorial screening 
library design plotted according to their Tanimoto 2D pairwise similarity after reactant filtering 
S and after final prcxluct selection. 

Hgure IS shows the distribution of molecules plotted according to their Tanimoto 2D 
pairwise similarity of duee database libraries (Cbapvmi &~Hall) from the prior art 

Figure 16 shows a schematic rq)resentation of sets of possible reactants attached to a 
central core. 

10 Figure 17 is a flowchart summarizing the overall process of virtual library construction. 

Figures 18, 19, and 20 are a flowchart summarizing the overall process of applying the 
Tanimoto fingerprint metric for use in the virtual library. 

Figures 21, 22, and 23 are a flowchart summarizing the overall process of using the 
Tammoto fingerprint metric to search for molecules. 
IS Figures 24, 25, and 26 are a flowchart summarizing the overall process of using both 

the topomeric CoMFA and Tammoto metrics to search for molecules in the virtual library. 

Figures 27, 28, 29, and 30 are a flowchart summarizing the overall process for 
topomeric searches of aibitrary query molecules. 

Figure 31 diows the topomeric conformations of Tagamet and 2^tac. 

20 Disclosure Of Invention 

1. Computational Oiemistry Environment 

2. Definitions 

3. Validating Metrics- 

A. Theoretical Considerations - Neighborhood Prop^y 
2S B. Construction, Application, and Analysis Of Patterson Plots 

4. Topomeric CoMFA Descriptor 

A. Tcqwmeric Alignment 

i. General Topomeric Allignmoit 

ii. Specialized AUignment for Chiral and Equivalent Atoms 
30 B. Calculation Of CoMFA and Hydrogen Bonding Fields 

C. Validation Of Topomeric CoMFA Descriptor 

5. Tanimoto Fingerprint Descriptor 
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A. Neighborhood Property 

B. Applicability Of Tanimoto To Different Biological Systems 

C. Comparison of Sigmoid and Patterson Plots 

6. Comparison of Tanimoto and Topomeric CoMFA Metrics 

7. Additional Validation Results 

8. Combinatorial Library Design Utilizing Validated Metrics 

A. Removal Of Reactants^For Non-Diversity Reasons 

i. General Removal Criteria 

ii. Biologically Based CriteriaS 

B. Removal of N<m-Diverse Reactants 

C. Identification (Building) Of Products 

D. Removal Of Products For Non-Diversity Reasons 

E. Removal of Non-Diverse Products 

9. Lead Compound Optimization 

A. Advantages Resulting From Product Filter 

B. Advant^es Resulting From Reactant Filter 

C. Additi<mal Optimization Methods Using Validated Metrics 

10. Merging Libraries 

11. Other Advantages of Optimally Diverse Libraries 

12. Virtual Library Construction & Searching 

A. Derivation of the Database (Virtual Library) of Compounds 

B. Overview of M^odology 

C. Ov^ew of Virtual Library Construction 

D. Virtual Library Construction 

i. Represratation of the Database of Compounds 

ii. Application of A First Metric (Topomeric CoMFA) 

iii. Application of A Second Metric (Tanimoto Fingerprint) 

iv. Summary of Method & Scope of Chemistry 

E. Searching the Virtual Library 

i. Example Search Routine of Virtual Library - tanimoto 

Similarity 

ii. Design Screening Libraries (Subsets of the Virtual Library) 

(a) Subset Screening Library Based On Topomeric Fields 
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and Tanimoto 



(b) Subset Based on Tanimoto Similarity 

(c) Subset Based on Topommc Fields 

(d) Subset Based on Combined M^c 



5 



iii. Designing Lead Optimizations 

(a) Search Based on Tanimoto Similarity 

(b) Searches Based on Topomer Similarity 

(c) Tqx>meric (3D) Searching of Art>itrary Molecular 



Structures 



10 



(d) Topomeric (3D) Searching of Core Structures 



h Cqmpmatiw^l Chemistry EnvirpnmCTt 

Generally, all calculations and analyses to conduct combinatorial chemistry screening 
library design and follow up are implemented in a modem computational chemistry 
environment using soft\vare designed to handle molecular structures arul associated properties 

IS and operations. For purposes of this Application, such an environment is specifically 
referenced. In particular, the computational environment and capabilities of the SYBYL and 
UNITY softymc programs developed and marketed by Tripos, Inc. (St. Louis, Missouri) are 
specifically utilized. Unless otherwise noted, all software references and comriiands in the 
following text are references to functionalities contained in the SYBYL and UNITY software 

20 programs. Whwe a required functionality is not available in 51811, or UNITY^ the software 
code to implem^t that functionality is provided in an Appendix to this Application. Software 
with similar functionalities to SYBn. and UNITY are available from oth^ sources, both 
commercial and non-commercial, well known to those in the art. A general purpose 
programniable digital computer with ample amounts of memory and hard disk storage is 

25 required for the implementation of this invention. In performing the methods of this invention, 
r^resentations of thousands of molecules and molecular structures as well as other data may 
need to be stored simultaneously in the random access memory of the computer or in rapidly 
available permanent storage. The inventors use a Silicon Gn^hics, Inc. Omllenge-M computer 
having a single ISOMhz R4400 processor with 128 Mb memory and 4Gb hard disk storage 

30 space. As the size of the virtual library increases, a corresponding increase in hard disk storage 
and computational power is required. For these tasks, access to several gigabytes of storage 
and Silicon Graphics, Inc. processors in the R4400 to R 10000 range are useful. 
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2. Definitions: 

The words or phrases in capital letters shall, for the purposes of this s^lication, have 
the meanings set forth bdpw: 

2D MEASURES shall mean a molecular rqraesentation whidi does not include any 
5 team which specifically incorporate informatim about the three dimensional features of the 
molecule. 2D is a misnomer used in the art and does not mean a geometric "two dimensional" 
desonptor such as a flat image on a piece of paper. Rather, 2D descriptors take no account of 
geometric features of a molecule but instead reflect only the properties vAdcb arc derivable 
from its topology; that is, the network of atoms connected by t>onds. 
10 2D FINGERPRINTS shall mean a 2D molecular measure in which a bit in a data string 

is set corresponding to the occurrence of a given 2-7 atom fragment in that molecule. 
Typically, strings of roughly 900 to 2400 bits are used. A particular bit may be set by many 
different fragments. 

COMBINATORIAL SCREENING UBRARY shall mean a subset of molecules selected 
15 from a combinatorial accessible universe of molecules to be used for screening in an assay. 

MOLECULAR STRUCTURAL DESCRIPTOR shall mean a quantitative representation 
of the physical and chemical properties determinative of the activity of a nK>lecule. The term 
METRIC is synonymous with MOLECULAR STRUCTURAL DESCRIPTOR and is used 
interchangeably throughout this Application. 
20 PATTERSON PLOTS shall mean two dimafisional scatter plots in which the distance 

between molecules in some metric is plotted on Uie X axis and the absolute difference in some 
biological activity for the same molecules is plotted on the Y axis. 

SIGMOID PLOTS shall mean two dimensional plots for which the proportion of 
molecular pairs in which the second molecule is also active is plotted on tiie Y axis and the 
25 pairwise Tanimoto similarity is plotted in intervals on die X axis, 

TOPOMERIC ALIGNMENT shall mean conformer alignment based on a set of 
alignment rules. 
3. v^ii^atine Metrics 

A. Theoretical Considera tions - Neighborhood Property 
30 As noted above, the similarity principle suggests a way to quantify die concept of 

diversity by quantifying structural similarity. While the prior art devised many structural 
descriptors, no one has been able to expliciUy show that any of the descriptors are valid. It is 
possible with the method of this invention to determine the validity of any metric by applying 
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it to presently existing literature data sets, for which values of biological acdvity and molecular 
structure are known. Once the validity has been determined, the metric may be used with 
confidence in designing combinatorial screening libraries and in following up on discovered 
leads. Examples of these applicadcms will be given bdow. 

5 The present invention is the fust to recognize that the similarity prindp 

a way to validate metrics. Spedfically» the similarity principle requires tfiat any valid 
desoiptor mu^ have a "neighbcnrhood prc^rty". Ths^ is: the descriptor must meet the 
similarity principle's constraint that it measure the chemical universe in such a way that ^milar 
structures (as defined by the descriptor) have substantially similar biological prc^)erties. Or 

10 stated slightly differently: within some radius in descriptor space of any given molecule 
possesung some biological property, there should be a high probability that other molecules 
found within that radius will also have the same biological property. If a descriptor does not 
have the ndghborhood property, it does not meet the similarity principle, and can not be valid. 
R^ardless of the computations involved or the intradons of the users, using prior art 

15 descriptors without the neighborhood prc^)erty results, at best, in random sdection of 
compounds to include in screening libraries. 

The importance of the neighborhood pnq)erty to the design of combinatorial screening 
libraries is schematically illustrated in Figure 1. Figure lA and Figure IB show an "island" 
1 of biological activity plotted in some relevant two dimensional molecular descriptor space. 

20 In Figure lA the molecules 2 of a typical prior art library are pleated as he^cagpns. Around 
each hexagon a circle 3 describes the area of the metric space (the neighborhood) in which 
molecules of similar structural diversity to the plotted molecule would be found. Since the 
prior art metric used to select these molecules was not valid, the molecules are essentially 
distributed at random in the metric space. The circles 3 (neighboriioods) of similar structural 

25 diversity of several of the molecules overlap at 4 indicating that they sample the same diversity 
^pace. Qearly, there is no guarantee that the island area will be adequately sampled or that 
a great deal of redundant testing will not be involved with such a library design. 

In Figure IB the molecules 5 of a optimally designed library are plotted as stars along 
with their corresponding circles 3 of sinular structural diversity. Since a valid molecular 

30 descriptor witii the neighborhood property was used to select the molecules, molecules were 
identified which not only sampled that part of the descriptor space accessible with the 
molecular structures available but also did not sample the same descriptor space more than 
once. Clearly, the likelihood of sampling the *'island" 1 is greater when it is possible to 
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identify the unique neighborhood 3 around each sample molecule and choose molecules that 
sample different areas. Figure IB represents an optinudly diverse design. 

A method to quantitatively analyze whether any given metric obeys the ndghborhood 
imnciple has been discovered. In the prior art, absolute values of biological activity have 
5 always been oonsid^ed the dependent variable with the structural metric as the independent 
variable. This is the case for traditional QSARs (quantitative structure activity relationships). 
Note howev^t that the sinularity principle requires that for any pair of m<decules, differences 
in activity are related to differences in structure. In particular, small diffidences in structure 
should be associated with small differences in ac^vity. However, the converse is not 

10 necessarily true; large differences in activity are not necessarily associated with large 
differences in structure. The first novel feature of the present invention is that it uses 
differences in both measures: biological differences and structural (metric) differences. There 
is no rationale present in the prior art suggesting that the use of both differences in such a 
manner would be useful. Thus, instead of looking at the values assigned by the metric to each 

15 molecule, the absolute differences in the metric values for each pair of molecules are the 
independent variables and the absolute differences in biological activity for each pair of 
molecules are the dependent variables. The absolute value is used since it is the difference, not 
its sign, which is important. 

For a metric possessing the neighborhood property, a scatter plot of pairwise absolute 

20 diffraiences in descriptors for each set of molecules versus pairwise absolute differences in 
biological activity for the same set of molecules (Patterson plot) will have a characteristic 
appearance as shown in Figure 2. Note that it is important that pairwise absolute differences 
for all inolecules in a data set are used, that is; the absolute metric "distance** between every 
mdecule and every other molecule is plotted. Accordingly, there are n(n-l)/2 pairwise 

25 comparisons for every data set containing n compounds. The use of pairwise differences for 
every possible pair reflects all the relationships between all structural changes with all activity 
changes for the molecules under study. 

Line 1 on the graph of Figure 2 depicts a special case where there is a strictiy linear 
relationship between differences in metric distance and diffidences in biological activity, 

30 However, the neighborhood property does not imply a linear correlation (corresponding to 
points lying on a straight line) and need not imply anything about large property differences 
causing large biological activity differences. (Generally, the line should be linear for only very 
small changes in molecular structure and would exhibit a complex shape overall depending on 
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the nature of the biological interaction. However, for purposes of discussion and analysis, it 
is useful to employ a straight line as a first Jwroximation,) The slope of line 1 will vary 
depending on the biological acdvity of the measured system. Thus, the lower right trapezoid 
(LRT) {defined by the vertices [0,0], [actual metric value, max. bio, value], [max. metric 
S value, max. bio. value], and [max. metric value, 0]} of the plot may be populated as shown 
in any numb^ of ways. 

The upper left triangle (ULT) of the plot (above the line) should not be populated at 
all as long as the descmptor completely characterizes the compound and there are no 
discontinuities in the behavior of the molecules. However, in the real world, some population 

10 of the space (as indicated by points 2) above the line would be expected since theie are known 
discontinuities in the behavior of real molecular ligands. For instance, it is well known 
among^ medicinal chemists that adding one methyl group can cause some very active 
compounds to lose all sign of activity. 

Figure 3 shows a Patterson plot of a real worid example. Points lying above the solid 

15 line near the Y axis reflea a metric space where a small difference in metric property 
(structure) produces a large difference in biological prc^rty. These points clearly violate the 
similarity principle/ndghborhood rule. Thus, in the real worid sometimes relativdy small 
differraices in structure can produce large differences in activity. If some points lie above the 
line, this metric is less ideal, but, cleariy still useful. The major criteria and die key point to 

20 recognize is that for a rhetric to be valid the upp^-left triangle will be substantially less 
populated than the lower right ttapeizoid. 

Thus, it should be recognized that for any recq>tor, the presence of some particular side 
group or combination of ade groups may produce a dispontinuity in the receptor reqxMise. 
Genoally, however, any (metric) descriptor displaying the above characteristic of 

25 predominantly populating the lower right tr^)ezoid (such as in Figure 3) will possess the 
neighborhood propoty, and the demonstration that a metric possesses such behavior indicates 
the validity/usefulness of that metric. Conversely, a descriptor in which the points in the 
differ»ice plot are uniformly distributed (equal density of points in ULT and LRT) does not 
obey the neighborhood principle and is invalid as a metric. While a brief glance at the 

30 difference plots may quickly indicate validity or non-validity, visual analysis may be 
misleading. As it turns out, data points in the plot frequently overlap so that visually only one 
point is seen where there may be two (or more). A quantitative analysis of the data 
distribution, therefore, yields a more accurate picture. An objective validation procedure for 
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determining the validity/usefulness of metrics from Patterson plots of real worid data including 
a method for assessing its statistical significance is set forth below. 

Viewing the metric data in this way requires no knowledge about dther the actual value 
of the biological activities or the actual values assigned by the descriptor under review, 
S Because alt pairwise differences are displayed, all posdble gradaticms of molecular structural 
diversity and activity are rqnesented and utilized. Consequently, there is no arbitrary lower 
limit set on the usable data. 

B, Construction, Application, and Anal ysis Of Patterson Plots 

For purposes of objectively examining metrics for validity, it is first necessary to 

10 accurately determine the slope (placement) of the line which divides a Patterson plot into the 
two areas/ a lower right trapezoid (LRT) and an upper left triangle (ULT). The triangle is 
darned by the points [0, 0], [actual metric value, max. bio. value], and [0, max, bio. value]. 
ITie trapezoid is defined by the points [0,0], [actual metric value, max. bio. value], [max. 
metric value, max. bio. value], and [max. metric value, 0]. For a metric to be a valid and a 

15 useful measure of molecular diversity, the density of points in the lower right t^^pezoid should 
be significantly greater than the density in the upper left triangle. To determine the correct 
placement of the line, the variation in the density of points is used. The line must always pass 
through (0,0) at the lower left comer of a Patterson plot since no change in any metric must 
imply no change in the biological activity. As noted earlier, considering a straight line is only 

20 a first approximation. A "perfect" metric, which totally describes the structure activity 
relationship of the biological system, would display a complex line reflecting the biological 
interaction. As a first approximation, a "useful" straight line can be found which meaningfully 
reflects the variation in the density of pdnts. 

The preferred search for the correct/useful line tests only those slopes which a 

25 particular data set can distinguish; specifically those drawn from [0,0] to each point [actual 
metric value, max bio value]. The process starts by drawing the line to a point having the 
smallest actual metric value [smallest metric value, max. bio. value] and continues for all of 
the values observed for actual metric value up to the largest [largest metric value, max. bio. 
value]; ie, subsequent lines are of decreasing slope^ (In the limiting case of drawing tiie line 

30 to [largest metric value, max. bio. value] the trapezoid becomes a triangle.) When searching 
for the correct diagonal, it is defined to be the one which yields the highest density (number 
of data points/unit graph area) for a lower right triangle, which for this process is defined to 
have its vertices at [0, 0], [actual metric value, 0], and [actual metric value, max bio. value]. 
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Thus, the line is identified based on the density of points under this triangle, but the evaluation 
ratios for the metric are calculated based on the density within the tnQ>ezoid compared to the 
density of the entire plot (sum of triangle and trapezoid aireas). Hie software necessary to 
implement this procedure (as well as to determine the values to be discussed below) is 

5 cmtained in Appendix "A". There may be other procedures for d^^mining the placement of 
the line since the line is only a first approximation. Any such procedure must meet two tests: 
1) it must omsistently distinguish between diversiQr<lescriptors; and 2) it must cleaHy 
distinguish/recognize meaningless div^ty descriptors. The procedure described here clearly 
meets both tests. (The preferred search for the placement of the line is as described above. 

10 However, the lines shown in the Figures accompanying this description were found slightly 
differmtly. For the Figures, the search was started by requiring that the diagonal also pass 
dirough the point defined by the largest descriptor difference and the maximum biological 
activity difference [ max.metric value, max. bio. value]. The line was then systematically tilted 
towards the vertical trying each of 100 evenly spaced stq>s (in terms of the Y/X ratio). As in 

15 the prefixed method, the line yielding the highest density for the LRT was drawn. The line 
placements ^dded by the two m^ods are not substantially different. All numerical values 
reported in this qiedfication were obtained from Patterson plots in which the preferred line 
drawing process was used.) 

The Patterson plot showing the diagonal for an exemplary data set used to validate the 

20 topomeric GoMFA descriptor (discussed in Section 4.C. below) is shown in Figure 3. For 
comparison, Figures 4 and S show Patterson plots for two. oUier variations of the same data 
which would not be expected to be valid molecular "measurements" useful as diversity metrics. 
For Figure 4, in place of the actual metric values of Figure 3, random numbers were genmted 
for the diver^ty descriptor values of each compound and the Patt^^ plot generated from the 

25 differences in these random numbers. As expected from a random number assignment, no line 
can be found by the procedure which enriches the density in the triangle and the best ratio is 
not significantiy different from 1.0. The best line is always reported by the procedure, which 
in this case corresponds to a nearly vertical line drawn to tiie point [minimum metric value, 
max. bio. value]. For randomly distributed values, this line yields the highest density for the 

30 test triangle since the X axis value and, therefore, the area of the tested triangle, i^ at a 
nunimum. It is possible with some random data sets that Uiis line, although nearly vertical, 
might include a couple points under the line. The placement of the line at this position is 
essentially an artifact of the procedure which results from an inability to find any otiier line 
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which enriches the density in the tested triangle. 

Because random numbers are not "real" metrics, an example of a "real molecular 
measurement* that is unliicely to be a valid diversity metric was examined. For die Patterson 
plat of Figure 5, a force fidd strain energy (for the topomeric conformations using the 
S standard Tripos force field) was calculated for each of die compounds in the same data set as 
was used for Figures 3 and 4. Because force field strain mergy tends to increase witfi flje 
number of atoms and thus, cortdaie-roughly with the occasionally useful molecular weight, 
to normalize die value, the force field energy was divided by die number of atoms in each 
molecule. As expected, just as wiUi random numbers^ no optimum line could be found. This 

10 is essentially a confirmation tiiat the points in die graph were also distributed randomly. Again, 
die best ratio is not significantiy different from 1.0, 

To objectively quantify die validity/usefulness determination, the ratio of the density 
of points in die lower right trapezoid to die average density of points is determined. This value 
can vary from somewhere above 0 but significantiy less than 1, through 1 (equal density of 

15 points in each area) to a maximum of 2 (all die points in die lower right trapezoid, and die 
qiper triangle and lower trapezoid are equal in area [limiting case of trapezoid merging into 
triangle]). According to the theoretical considerations discussed above, a ratio very near or 
equal to 1 (approximately equal densities) would indicate an invalid metric, while a ratio 
(significantiy) greater tiian 1 would indicate a valid metric. The value of tiiis ratio is set forth 

20 next to each Patterson plot in Figures 3 (real data), 4 (random numbers substituted), and-5 
(force field energy substituted) under the column "Density Ratio", Clearly, die topomeric 
CoMFA data of Figure 3 reflect a valid metric (ratio much larger dian 1), while die random 
numbers of Figure 4 and force field enagies of Figure 5 reflect a meaningless invalid metric 
(ratio very near 1). As will be discussed below, a density ratio of 1 . 1 is a usefid direshold of 

25 validtty/usefubiess for a molecular diversity descriptor. 

The statistical significance of die Patterson plot data can also be determined by a chi- 
squared test at any chosen level of significance. In this case die data are handled as: 

j2 ^ (Actual LRT Count - Expecud LRT Coumj^ 
Expeaed LRT Count 



where: Expected LRT Count = x Total Count 

Total Area 
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The chi*squared values for the Patterson plots of Figures 3, 4, and S are also set forth next to 
the plots under the column For 95 % confidence limits and one degree of freedom, the chi- 
squared value is 3.84. The chi-squared values confirm the visual inspection and density ratio 
observations that the CoMFA metric is valid and the oth^ two "constructed" metrics are 

S invalid. A full set of topomeric CoMFA, random number, and forc^ field data are discussed 
bdow under validation of the topomeric CoMFA descriptor. 

The analysis of metrics using the difference plot of this inventicm is a powoful tool 
with which to examine metrics and data s^. First, the analysis can be used with any system 
and requires no prior assumptions about the range of activities or structures which need to be 

10 considered. Second, the plot extracts all the information available from a given data set since 
pairwise differences between all molecules are used. The prior art believed that not much 
information, if any, could be extracted from literature data sets since, generally, there is not 
a great deal of structural variety in each set. On the contrary, as will be shown below, using 
the Patt^son plot method of this invention, a metric can be validated based on just such a 

15 limited data set. As will also t>e demonstrated below, metrics can be applied to literature data 
sets to determine the validity of the metrics. This ability Op^s up vast amounts of pre-existing 
literature data for analysis. Since in any analysis there is always a risk of making an improper 
determination due to sampling error wh^ too few data sets are used or too narrow a vari^y 
of biological systems (activities) are included, the ability to use much of the available literature 

20 is a significant advance in the art. Also, the fact that the validation analysis methodology of 
this invention is not dqiehdent on the Study of a specific biological system, strongly implies 
that a validated metric is very likely to be applicable to molecular structures of unknown 
biological activity encountered in designing combinatorial screening libraries or making other 
diversity based sdecdons. Or stated slightly differently, there is a high degree of confidence 

25 that metrics validated across many chemistries and biologies can be used in situations where 
nothing is known about the biological system under study. 
4. Topomeric CoMFA Descriptor 

Many of the prior art descriptors are essentially 2D in nature. That this is the case with 
the prior art probably reflects three underlying reasons. First, the rough general associations 

3d between fragments and biological properties were validated statistically decades ago.^ Second, 
2D fragment keys or "fingerprints" are widely available since they are used by all commercial 
molecular database programs to compare structures and expedite retrieval. Third, no one in 
the prior art has yet met the challenge of figuring out how to formulate and validate an 
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appropriate three dimensional molecular siructurai descriptor. The situation in the prior art 
before the present invention is very similar to the field of QSAR about ten years ago. Then, 
the prior art had long recognized the desirability of thiee dimensional descriptors but had not 
been able to implemmt any. When a 3D technique (CoMFA) became ava^lable^ its widespread 
5 accq)tance'^ and sqiplication'' confirmed the expected importance of 3D descriptors in general. 

It has been discovered that a CoMFA approach to generating a molecular structural 
^lesaiptor using a qiecially devdoped^ignment procedure, topomeiic alignment, produces a 
three dimensional descriptor of molecules which is shown to be valid by the mettiod outlined 
above. In addition, this new descriptor provides a powofiil tool with which to design 

10 combinatorial screerung libranes. It is equally useful any time selection based on diversity 
from within a congen^c series is required. A full description of CoMFA and the generation 
of molecular interaction energies is contained in U.S. Patents 5,025,388 and 5,307,287, the 
disclosures of the^ patents are incorporated in this Application. The usual challenge in 
applying CoMFA to a known set of molecules is to determine the proper alignment of the 

15 molecular structures with respect to each other. Two molecules of identical structure will have 
substantially different molecular interaction energies if they are translated or rotated so as to 
move thdr atoms more than about 4 A from their original positions. Thus, alignment is hard 
CTOugh when jqjplying CoMFA to analyze a set of molecules which interact wiUi the same 
biological recq)tor. The more difficult question is how to "align" molecules distributed in 

20 multidimensional chemistry space to create a meaningful descriptor with respect to arbitrary 
and unknown receptors against which the molecules will ultimately be tested. The topomeric 
alignment procedure was developed to correct the usual CoMFA alignments which often over- 
emphasize a seardi for "receptor-bound", "minimum energy", or "field-fit" conformations. It 
has been discovered that, when congenericity exists, a meaningful alignment results from 

25 overlaying the atoms that lie within some selected common subsUiicture and arranging the 
other atoms according to a unique canonical rule wiUi any resulting steric collisions ignored. 
When CoMFA fields are generated for molecules so aligned, it has been discovered tiiat the 
resulting field differences are a valid molecular structural descriptor. 

Two major advantages are achieved by applying tiie topomeric CoMFA metric to the 

30 reactants pressed for use in a combinatorial synthesis ratiier tiian ttie products resulting from 
the synthesis. First, the computational time/effort is dramatically reduced. Instead of analyzing 
for diversity a combinatorial matrix of product compounds (Rl x R2 x R3 ...) only tiie 
values for the sum of tiie reactants (Rl + R2 + R3 ...) need to be computed. For example. 
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assuming 2000 leactants for Rl and 2000 reactants for R2, only 4000 calculations need be 
performed on the reactants versus 2000^ (4,000,000) if calculations on the combinatorial 
pnxlu(^ were performed* Second, by idratifying reactants which explore similar diversity 
space, it is only necessary to choose one of each reactant representative of each diversity. This 
5 immediately reduces the number of combinatorial products which need to be considered and 
synthesized. 

A. Toppmgrig^HCTmCTt 

Usually a CoMFA modeler sedcs low en^y conformations. However, if alignment 
with unknown recqitors is desired (such as is the case in designing combinatorial screening 

10 libraries for general purpose screening), then the major goal in conformer generation must be 
that molecules having similar topologies should produce similar fields. In feet, topomeric 
CoMFA fields may be used as a validated diversity descriptor to idmtify molecules with 
similar or dissimilar structures anytime there is a problem of having more compounds than can 
be easily dealt with. Thus, its sq^licability extends well beyond its use in combinatorial 

IS chemistry to all situations where it is necessary to analyze an existing group of compounds or 
specify the creation of new ones. The tq[>omeric alignmoitf procedure is espedally applicable 
to the design of a combinatorial screwing library. Typically, as noted earlier, in the creation 
of combinatorially derived compounds there is often an invariant central core to which a 
variety of side chains (contributed by reactants of a particular class) are attached at the open 

20 valences. Within the combinatorial products, this central core tethers each of the side chains 
contributed by any set of reactants into the same relative position in space. In the language of 
CoMFA alignm^ts, the side chains contributed by each reactant can thus be oriented by 
dv^lapping the bond that attaches the side chain to the central core and using a topomeric 
protocol to select a representative conformation of the side chain. Nowhm does the prior art 

2S suggest that a topomeric protocol could posdbly yield a meaningful alignment. Indeed, the 
prior art inherently teaches away from the idea because the topomerically derived conformers 
often may be energetically inaccessible and incapable of binding to any receptor. 

The idea of a topomeric conformer is that it is rule based. The exact rules may be 
modified for specific circumstances. In fact, once it is appreciated irom the teaching of this 

30 invention that a particular topomeric protocol is useful (yields a valid molecular descriptor), 
other such protocols may be designed and their use is considered within the teaching of this 
disclosure. 
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i. General Topomeric Alignment 
With the exception of two specialized situations (niolecules containing chiral atoms or 
requiring a choice between two equivalent atoms) which will be discussed in section 4(A)(ii) 
below, the following topolpgically-based rules will generate a single, consistent, unambiguous, 
S aligned topomeric conformation for any molecule. The software necessary to implement this 
procedure is contained in Appendix "A". The starting point for a topomeric alignment of a 
molecule is a C^CORD generated three dimaisicmal model which is then FIT as a rigid body 
onto a template 3D model by least-squares minimization of the distances between structurally 
corresponding atoms. By convention, the template model is originally oriented so that one of 
10 its atoms is at the Cartesian origin, a second lies along the X axis, and a third lies in the XY 
plane. 

Torsions are then adjusted for all bonds which: 1) are single and acyclic; 2) connect 
polyvalent atoms; and 3) do not connect atoms that are polyvalent within the template model 
structure since adjusting such bonds would change the template-matching geometry. 

IS Unambiguous specification of a tor^on angle about a bond also requires a direction along that 
bond and two attached atoms. In this situation, for acyclic bonds the direction "away from the 
FIT atoms* is always well-defined. 

The following precedence rules thai determine the two attached atoms. From each 
candidate atom, begin grovidng a "path", atom layer by atom layer, including ail branches but 

20 ending whenever another path is encountered (occurrence of ring closure). At the end of the 
bond that is closer to the FIT atoms, choose the attached atom beginning the shortest path to 
any FIT atom. If there are several ways to choose the atom, first choose the atom with fte 
lowest X, If there are still several ways to choose the atom, choose next the atom with the 
lowest Y, and finally, if-necessary, the lowest Z coordinate (coordinate values differing by 

25 some small value, typically less than 0.1 Angstroms, are considered as identical). At the other 
end of the bond, choose the atom beginning the path that contains any ring. When more than 
one path contains a ring, choose the atom v/bose path has the most atoms. If there are several 
ways to choose the path, in precedence order choose the path with the highest sum of atomic 
wdghts, and finally, if still necessary, tiie atom with the highest X, then highest Y, then 

30 highest Z coordinate. The new setting of the torsional value depends only on whether the 
bonds to the chosen atoms are cyclic or not. If neither are cyclic, the setting is 180 d^rees; 
if wie is cyclic, the setting is 90 degrees; and if both are cyclic, the setting is 60 degrees. Any 
steric clashes that may result from these settings are ignored. 
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As an illustrative example, consider generation of the topomeric conformer for the side 
chain shown in Figure 6(A), in which atom i is attached to some core structure by the upper 
kft- most bond. Assuming that the alignment template for this fragment involves atom 1 only, 
thm are three bonds whose torsions lequiie adjustment, those connecting atoms pedrs 1-3; 

S S - 8; and 10 - 14. (Adding atom 3 to the alignment template would make atom 1 ''polyvalent 
widiin the template model structure", so that the 1 - 3 bond would then not be altered.) The 
atom whose attached ztxms mil move fm the torsion adjustment) is the second atom tMed in 
each atom pair. For example, if a torsional change were sqyplied to the 14 - 10 bond instead 
of the 10 - 14 bond as shown in Figure 6 A, all of the molecule except atoms 10, 14 and IS 

10 (and 13 by symmetry) would move. Correspondingly, if a torsional change were sq>plied to the 
10 - 14 bond instead of the 14 - 10 bond, only atom IS would move. 

To define a torsicmal change, atoms attached to each of the bonded atoms must also be 
specified. For example, setting torsion about the bond 5 - 8 to 60 degrees would yield four 
diffwent conformcrs depending on whether it is the 6-S-8-13, 6-5-8-9, 4-5-8-9, or 4-5-8-13 

15 dihedral angle which becomes 60 degrees. To make such a choice, "paths" are grown firom 
each of the candidate atoms, in "layers", each layer consisting of all previously unvisited atoms 
attached to any existing atom in any path. In choosing among the four attached-atom 
pos^ilities of the 5 - 8 bond, Figure 6(B) shows the four paths after the first layer of each 
is grown, and Figure 6(C) shows the final paths. In Figure 6(C), notice within the rings that, 

20 notonly is the bond between 3 and 7 not crossed, but also atom 1 1 is not visited because die 
third layer seeks to include 1 1 from two paths, so both Eail. The attached atoms chosen for 
the torsion definition becomes the ones that begin the highest-ranking paths according to the 
rul^ stated above. For example, in Figure 6(C), attached atom 4 outranks atom 6 because its 
path is the only one readung the alignment template, and atom 9 outranks atom 13 because 

25 its path has more atoms, so that it is the 4-5-8-9 torsion which is set to a prescribed value. 
For the same reasons, the other complete torsions become 9-10-14-15, attached 1-3-4 and 
attached 1-2-16. The other decision rules would need to be applied if atom 9 was, instead of 
carbon, an aromatic nitrogen (with the consequent loss of the attached hydrogen) so that the 
9 and 13 paths have the same number of atoms. In this case, the 9 path still takes priority, 

30 since it has the higher molecular weight. If instead atom 14 is deleted, so that the 9 and 13 
paths are topologically identical, the 9 path again takes priority because atom 9 has the same 
X coordinate but a larger Y coordinate dian does atom 13. 

As for the dihedral angle values themselves, torsion 4-5-8-9 is set to 60 degrees, 
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because both the 4-5 and 8-9 bonds are within a ring; torsions 9-10-14-15 and attached -1-3-4 
become 90^, because only the 3-4 and 9-10 bonds respectively are cyclic; and the attached -1- 
2-16 dihedral becomes 180^ since none of the bonds are cyclic. It should be noted that this 
Uqxmieric alignment procedure will not work widi molecules containing chiral centers dnce, 
5 for each diiral center, two possible three dimoisional configurations are possible for the same 
molecule, and, clearly, each configuration by the above rules would yield a different topomeric 
amformer. 

ii. Specialized Allipiment for Chiral and Equivalent Atoms 
In order to resolve the ambiguity introduced by a chiral centa^ or centm in a molecule, 

10 a spedalied topermic allignment rule must be adopted. Figure 6(D) shows a side chain whose 
attachment atom is marked as "Root" and in which atom I is chiral. Atom I has four 
non-equivalent attachments, indicated by Root, J, K, and L. Although the absolute 
configuration of such a chiral atom is not usually specified, an allingment m^odoldgy of an 
explicit 3D model must necessarily consistently select one of the two possible conformations; 

15 even if arbitrarily chosen. Proceeding as taught above, generating the topomeric conformation 
for the side chain leads to selection of atom J (the largest of the attadiments rooted by J, K, 
and L) as the atom defining the Root-I torsion and thus fixes the position of J However the 
relative positions of K and L remain ambiguous. Unless such "prochiral" atoms (including 
pyramidally hydrolyzed nitrc>gen) are recognized and a configuration explicitly assigned, side 

20 chains which axe topologically identical may seem to be very different in shape. 

The procedure used to make sure that the actual topomeric 3D models generated around 
chual centers are as similar as possible is as follows: first, form a list of all such chiral centers 
including pyramidal nitrogen (many algoritfims for doing this are described in the literature and 
are found in any nK)delling software); second, after an individual torsion has been set, as 

25 described earlier , if the third atom of the four in the torsion list is one of the chiral centers, 
[in Figure 6(D) the configuration of atom I will be adjusted just after the torsion about Root-I 
has been set] proceed to replace the fourth atom on the torsion list [J in Figure 6(D)] with the 
next highest attachment atom [following the eariier description this will be atom K in Figure 
6(D)]. If the dihedral angle value for the new torsion is greater than 180 degrees, then the 

30 reative position of atoms K and L must be exchanged To exchange the positions of atoms K 
and L, generate the plane defined by the second (Root) through fourth (J) atoms on the torsion 
that was initially set. Finally, reflect the coordinates of all the atoms attached to the third atom 
(I) through that plane. This topomeric procedure will generate a consistent topomeric 
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aUignment for all side chains containing chiral centers. 

A second specialized topomeric aliignment problem which may be encontered is the 
requirement to select between two equivalent atoms. This situation is also illustrated in Figure 
6(P) where there are two candidate attachment atoms, *A" and "a**, for the torsion 

5 A(a)-B-C-D. Topologically atoms "A" and "a" are identical, but a different position for the 
five-membered ring, hence a very different shape, will be generated dq)^ing on wh^her " A* 
or "a* is used to assign the torsion of A(a)-B-C-D. The following rule is used to ensure that 
the ch(Hce between "A* and "a" is made oon»^tly. Measure the two dihedral angles defined 
by the atom lists Root-B-C-A amd Root-B-C-a. (Although these atoms are obviously not 

10 directly connected, the dihedral angle values are well-defined.) Of the two possibilities, select 
the atom to define the torsion for which the torsional value lies between 170 and 350 degrees. 

Using the selection rules set out above, the critical point is that the use of a single 
tq)omerically aligned conformo- in computing a CoM F A three dimensional descriptor has been 
found to yield a validated descriptor. While other approaches to conformer selection such as 

IS averaging many rqiresentative conformers or classifying a representative set by their possible 
interactions with a theoretically averaged receptor (such as in the polyomino docking) are 
possible, it has been found that topomerically aligned conformers yield a validated descriptor 
which, as will be seen below, produces clustering highly consistent with the accumulated 
wisdom of medicinal chemistry. 

20 B. Calculation Of CoMF A and HvdrogCT Bonding Fields 

The basic CoMF A methodology provides for the calculation of both steric and 
electrostatic fields. It has been found up to the present point in time that using only the steric 
fidds yields a b^ter diversity descriptor than a combination of steric and electrostatic fields. 
There appear to be three fstctors responsible for this observation. First is the fact that steric 

25 interactions * classical bioisosterism - are certainly the best deflned and probably the most 
important of the selective non-covalent interactions responsible for biological activity. Second, 
adding the electrostatic interaction energies may not add much more information since the 
differences in electrostatic fields are not independent of the differences in steric fields. Third, 
the addition of the electrostatic fields will halve the contribution of the steric field to the 

30 differences between one shape and another. This will dilute out the steric contribution and also 
dilute the neighborhood property. Clearly, reducing the importance of a primary descriptor is 
not a way to increase accuracy. However, it is certainly possible that in a given spedal 
situation the electrostatic contribution might contribute significantly to the overall "sha^". 
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Under these unique circumstances, it would be s^propriate to also use the electrostatic 
interaction energies or other molecular characterizers, and such are considered within the scope 
of this disclosure. For instance, in some circumstances a tqx>meric CoMFA field which 
incorporates hydrogen bonding interactions, characterized as s^ forth below, may be useful. 
5 The steric fields of the topomerically aligned molecular side chain reactants are 

generated almost exa^y as in a standard CoMFA analysis using an sf^ carbon atom as the 
I»obe. AsinstandardCoMFA, both the grid spacing and the size of the lattice space^rwhic^ 
data points are calculated vdll dq)end on the size of the molecule and the resolution desired. 
The steric fields are set at a cutofT value (maximum value) as in standard CoMFA for lattice 

10 points whose total stmc interaction with any side-chain atom(s) is greater than the cutoff 
value. One difference irom the usual CoMFA procedure is that atoms which are sq>arated 
ftom any template-matching atom by one or more rotatable bonds are set to make reduced 
contributions to the overall steric field. An attenuation factor (1 - "small number**), preferably 
about 0.85, is applied to the steric fidid contributions which result from these atoms. For atoms 

15 at the end of a long molecule, the attenuation factor produces very small field contributions 
(ie: [0.85]^) where N is the number of rotatable bonds between the qiecified atom and the 
alignment template atom. This attenuation factor is applied in recognition of the fact that the 
rotation of the atoms provides for a flexibility of the molecule which permits the parts of the 
molecule furthest away ftom the point of attadiment to assume whatever orientation may be 

20 imposed by the unknown recqHor. If such atoms were weighted equally, the contributions to 
the fidds of the significant steric differences due to the more anchored atoms (whose 
disposition in the volume defined by the r6cq>tor site is most critical) would be overshadowed 
by the effects of these flexible atoms. 

The derivation of a hydrogen-bond field is slightly different from the standard CoMFA 

25 measuremoit. The intent of the hydrogen-bonding descriptor is to characterize similarities and 
differences in the abilities of side chains to form hydrogen-bonds with unknown recq>tors. 
Like the successful use of the topom^c conformation to characterize steric interactions, the 
tqwmeric conformation is also an q>propriate way to charact^ze the spatial position of a side 
chain's hydrogen-bonding groups. However, unlike a steric field, hydrogen-bonding is a 

30 spatially localized phenomenon whose strength is also difficult to quantitate. Hierefore, it is 
appropriate to rqjresent a hydrogen-bonding field as a bitset, much like a 2D fingerprint, or 
as an array of 0 or 1 values rather than as an array of real numbers like a CoMFA field. 
The hydrogen-bonding loci for a particular side chain are specified using the DISCO 
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appioach of "extension pQints" developed by Y. Martin" and coworkers, wherein, for 
example, a caitx>nyl oxygen generates two hydrogen-bond accepting lod at positions found by 
extending a line passing from the oxygen nucld thiough each of the two "IcHie-pair** locations 
to where a complementary hydrog«-bond donating atom on the iecq>tor would optimally be. 
S It is not pos^ble with a bitset tquesentaticm to attenuate the effects of atoms by the number 
of intervening rotatable bonds. Instead, uncotainty about the location of a hydrogen-bonding 
group can be rq>resented by setting additional bits for grid locations spatially adjacent to the 
single grid location that is initially set for each hydrogoi-bonding locus« In other words, each 
hydrogen-bonding locus sets bits corresponding to a cube of grid points ratiier than a single 

10 grid point The validation results shown in Table 4 were obtained for a cube of 27 grid 
locations for each hydrogen bonding locus. The single bitset representing a topomeric 
hydrpgoi-bonding fingerprint has twice as many bits as there are lauice points, in order to 
discrintinate hydrogen-bond accepting and hydrogen bond-donating loci. The difference 
between two topomeric hydrogai4x)nding fingerprints is simply their Tanimoto coefficient 

IS which now represents a difference in actual field values. Software whidi impiem^ts the 
hydrogen-bonding field calculations is provided in Appendix "B**. 
Ct V^jdation Of TWQffiCTic CqMFA pgsqriptpr 

Tht validity of topom^cally aligned CoNfFA fidds as a molecular structural 
descriptor, whidi can be used to describe the diversity of compounds, was confirmed on 

20 twmty data sets randomly chosen from the recent biochemical literature. The data sets spanned 
several different types of ligand-recq)tor binding interactions. The only criteria for the data 
sets were: I) the rqwrted biological activities must span at least two orders of magnitude; 2) 
the structural variation must be "monovalent" (only one difference. per molecule); 3) the 
molecules contain no chiral centers; and 4) no page turning was required for data entry in 

25 0010* to reduce the likelihood of entry errors. Each data set was analyzed indq>endendy. The 
identification of the data sets is set forth in Appendix "C**. The structural variations of the side 
chains of the core templates were entered as the Sybyl Line Notations of the corresponding 
thiols. (Sybyl line Notations [SLNs] define molecular structures.) An -SH was substituted for 
the larger common template portion of each molecule and provided the two additional atoms 

30 needed for 3D orientation. According to the validation method of this invention the Patterson 
plots constructed as discussed above for the twenty data sets are shown in Figures 7(a) - 7(t). 

In 17 of the 20 cases, visual inspection of the plots suggests that the daisity of points 
in the lower right trapezoid is, indeed, greater than the density in the upper left triangle as 
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predicted for a metric descriptor obeying the neighborhood rule. Also, for reasons noted 
earlier, some points do fall above the line as would be expected for the real >vor]d. However, 
the relative rarity of points in the upper left triangle of the plots indicates that ** small steric 
field differences are not likdy to i»oduce large diffemices in bioactivity*» the neighborhood 

5 rule. Thus, the distribution of points in the Patterson plots across all the randomly selected 
data sets is remarkably consistent with the theoretical prediction for a valid/useful diversity 
metric. It can be easily seen that the tqpomeric CoMFA metric is validated/useful. 

Table 1 contains the density ratios from the quantitative analysis of the twenty data sets. 
The den^ty ratios of the two test metrics (random number assignments and molecular force 

10 field energy divided by nuniber of atoms for the diversity descriptor values) described earlier 
are presented for comparison. values reflecting the statistical significance of the ratios are 
also set forth next to the corresponding ratios. 
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TABLE 1 
Patterson Plot Ratios and Associated 



No. 


Reference 


CoMFA 
Ratio 


CoMFA 
X» 


Random 
Rado 


Random 


Bnerey 
Ratio 


Enerev 


1 


Uehling 


1.71 


10.27 


0.98 


0.01 


0.98 


0.02 


2 


Strupczewsid 


1.39 


57.33 


1.01 


0.02 


0.97 


0.47 


3 


Siddiqi 


1.44 


6.26 


0.92 


0.01 


* 




4 


Ganatt-1 


L72 


13.01 


1.02 


0.02 


1.00 


0.00 


5 


Garratt-2 


1.37 


8.02 


1.04 


0.11 


0.97 


0.07 


6 


Heyl 


1.04 


0.08 


0.99 


0.01 


0.97 


0.05 


7 


CristalU 


1.40 


51.21 


1.00 


0.00 


0.96 


0.46 


8 


Stevenson 


0.95 


0.02 


0.98 


0.00 


0.98 


0.01 


9 


Dohoty 


1.63 


3.54 


1.02 


0.01 


0.96 


0.02 


10 


Penning 


1.45 


10.33 


0.99 


0.01 


1.00 


0.00 


11 


Lewis 


0.95 


0.04 


l.OS 


0.05 ^ 


0.97 


0.02 


12 


Kiystek 


1.64 


119.92 


1.00 


0.00 


0.97 


0.49 


13 


Yokoyama-l 


1.18 


1.88 


1.00 


0.00 


0.93 


0.41 


14 


Yokoyaina-2 


L23 


2.62 


1.02 


0.02 


0.99 


0.01 


15 


Syensson 


1.27 


3.72. 


1.04 


0.00 


0.99 


0.00 


16 


Tsutsumi 


1.38 


6.50 


0.94 


0.02 


0.96 


0.06 


17 


Chang 1 1.34 


45.55 


1.01 


0.12 


0.99 


0,03 


18 


Rosowsky 


1.71 . 


12.46 1 0.95 


0.10 


1:00 


0.00 


19 


Thompsm 


1.47 


3.% 


1.06 


0.09 


1.00 


0.00 


20 


Depreux 


1.22 


10.85 


0.98 


0.07 


m 






MEAN 


1.38 


18.38 


1.00 


0.03 


0.98 


0.12 




STND. 
DEVIATION 


0.24 


29.43 


0.04 


0.04 


0.02 


0.19 



♦ Data sets 3 and 20 are not reported for the force field energy because one of the 
structures in each data set (in the topomeric conformation) had a very strained energy 
greater than 10 kcal/mole-atom, which produced a discontinuously large metric 
differoice. 
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The chi-squared distributions for 1 d^ree of freedom are: 

P = .75 .90 ,95 .99 .999 

1.32 2.71 3.84 6.64 10.83 

Typically » a confidence level of 95% is con^dered appropriate in statistical measures 

5 A metric is considered valid/useful for an individual data set if the Patterson plot ratio 

is greater than 1.1; that is, there is greater Aan a 10% difference in the den^ty between the 
ULT and UtT. The use of 1 . 1 as a decisional criteria is confirmed by an examination of the 
scatter diagrams of values versus their corresponding ratios as shown in Hgures 8A and 
8B. (The value of X is actually plotted in Figure 8B in order to sqiarate the data points.) 

10 Figure 8A shows the plot of X^s having a value of greater than 3.84 (95% confidence limits) 
versus their corresponding ratios, while Figure 8B shows the plot of Xh (plotted as J X^ 
having a value less than 3.84 versus their corresponding ratios. A ratio value of greater than < 
1,1 (Figure 8 A) cleariy includes most of the statistically significant ratios, while a ratio value 
of less than 1 . 1 clearly includes most of the statistically insignificant ratios. While this is not 

15 a perfect dividing point and there is some overlap, there is also some distortion of the X^ 
values due to limited population sizes as discussed below. Overall, the value of 1.1 provides 
a reasonable decision point. 

As noted earlier, the validity of a metric should not be determined on the basis of one 
data set fix>m the literature. A single literature data set usually presents only a limited range 

20 of structure/activity data and examines only a single biological activity. To obtain a proper 
sense of the overall validity/quality of a metric, its behavior over many data sets representing 
many different biological activities must be considered. It should be expected for randomly 
sdected data sets that due to biological variability, an otherwise valid metric may appear 
invalid for some particular set. An examination of the data in Table 1 confirms this 

25 observation. 

Except for data sets 6, 8, and 11, the ratios in Table 1 clearly confirm for the 
topom^c GoMFA metric that the density of pcnnts in the LRT is greater than in the ULT, and 
the X* values confirm the significance of the plots. At the same time, the data for the two test 
metrics cleariy demonstrates with great sensitivity that this validation technique yields exactly 
30 the results expected for a meaningless metric; specifically, a density ratio substantially equal 
to 1 and no significance as determined by the X^ test. Contrary to accq)ted notions in the prior 
art, with the discovery of this invention, random literature data sets can be used to validate 
metrics. The type of publicly unavailable data set (as will be discussed in relation to the Abbott 
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data set below) where the bioactivity or inactivity for each molecule in the set has been 
experimentally verified is not required. 

Sets 6, 8, and 11 ane the exceptions which help establish the rule. It is realistic to 
expect that randondy selected data sets would include some where molecular edge (typically 

S a collision with receptor atxHns) or other distorting effects would be present For set 6, one 
^cperimratal value was so inconsistent with other iqx>rted values that the authors evm called 
attention to that fiact. In addition to a problematic experimental value, all the structural changes 
are ratho^ small but some of the biological changes are fairly large. Something very unusual 
is cieariy happesux\g with this system. For set 8, there is simply not enough data. Only S 

10 compounds (10 differences) were included and this proved insufficient to analyze even with 
the sensitivity of the Patterson plot. For data set 1 1 , there were two contributing factors. First, 
the data set was small (only 7 compounds). Second, this set is a good example of an edge 
effect where a methyl group protruding from the molecules interacts with the receptor site in 
a unique manner which dramatically alters the activity 

IS Generally, the values support the significance (or lack of significance) of the ratio 

values. However, for data sets 9, 13, 14, and IS the 95% confidence limit is not met. As with 
all statistical tests, is senative to the sample size of the population. For these data sets the 
N was simply too low. This sensitivity is well demonstrated by the difference in for sets 
14 and 20. The ratio values of the two sets are virtually identical, but the X^s differ 

20 significantly since set 14 has few points and set 20 many points. Thus, X^ may be used-to- 
confirm the significance of a ratio value, but, on the other hand, can not be used to discredit 
a ratio value when too few data points are present. It can be clearly seen tiiat the topom^c 
.CoMFA mttiic zppeaxs to define a useful dimensional space (measures chemistry space) better 
for some of the target sets than for others. 

25 As was discussed above, a metric need not be perfect to be valid. Even using an 

imperfect metric significantiy increases die probability that molecules can be properly 
characterized based on structural differences. As the quality of the metric increases, the 
probability increases. Hius, metrics which appear valid by the above analysis with respect to 
only a few test data sets are still useful. Metrics, like topomeric CoMFA, which are valid for 

30 85% (17/20) of the data sets yield a higher probability that structurally diverse molecules can 
be identified. 

Only wiUi respect to data sets 6, 8, and 11 does the topomeric CoMFA metric not 
appear to provide a useful measure. Considering the fact that some of the data sets have 
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limited samples and that a very wide range of biological interactions is iqiresented, it is not 
unexpected that random variations like this will appear. The aitically important aspect of this 
analysis is the fact that the metric is valid over a truly diverse range of types of ligand- 
substrate interactions. This strongly confirms its goierally applicability as a valid measure of 
S the diversity of molecules which can be used to select optimally diverse molecules from large 
data sets such as for use in combinatorial screening library design. 

Another impcfftant aspect of the inv^tion can be^erived from ttese plots. Upcwi dose 
examination it can be se^ that molecules having topomeric CoMFA differaices (distances) 
of less than approximatdy 80- 100 generally have activities within 2 log units of eadi other. 

10 This provides a quantitative definition of the radius of an area encompassing molecules 
possessing similar characteristics (similarly diverse) in topomeric CoMFA metric space - the 
ndghborhood radius. Because the topomeric CoMFA metric is a valid molecular structural 
descriptor, it is known that molecules with similar structure and activity will cluster in 
topomeric CoMFA space. Topomeric CoMFA distances can, therefore, be usefully used as a 

15 diversity measure in sdecting which molecules of a proposed combinatorial synthesis should 
be retained in the combinatorial screening library in order to have a high probability that most 
of the diversity available in that combinatorial synthesis is rq>resented in the library. Thus, 
for a combinatorial screening library, (wily one example of a molecular pair having a pairwise 
distance from the other of less than approximately 80 - 100 kcal/mole (belonging to the same 

20 divo^ity cluster) would be included. However, every molecule of a pair having a pairwise 
distance greater than approximately 80 - 100 would be included. Of course, the "fineness" of 
the resolution (the radius of the neighborhood in metric space) can be changed by using a 
different activiQr difference. The Patterson plot permits by direct inspection the determination 
of a neighborhood distance appropriate to any chosen biological activity diffidence. It is 

25 suggested, however, that for a reasonable seardi of chemistry space for biologically significant 
molecules, a difference of 2 log units is appropriate. The exact value chosen be adjusted to the 
circumstances. Qearly, the opportunity for real worid perturbing effects to dominate the 
measure is magnified by using less than 2 log units difference in biological activity. This is 
anodier example of the general signal to noise ratio problem often encountered in 

30 measurements of biological systems. For more accurate signal detection less perturbed by 
unusual effects, the data sets would ideally contain biological activity values spread over a 
wider range than what is usually encountered. The neighborhood radius predicted from an 
analysis of the topomeric CoMFA metric can now be used to cluster molecules for use in 
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selecting those of ^milar structure and activity (such as is desired in designing a combinatorial 
screoiing library of optimal diversity). 

The teachings of this disclosure so fiar may be summarized as follows: 1) a 
genendizable method for validating metric descriptors has been taught; 2) a specific descriptor, 
5 topommc CoMFAp has been described; and 3) the topomeric CoMFA descriptor has been 
validated over a divarse sampling of different types of biological interactions from publidied 
data~5ets. 

The extraordinary power inherent in the validation method to quantitatively determine 
a significant ndghbortKXxl radius is further demonstrated by a remarkable result (ri>tained in 

10 the analysis of a data set of potential reactants for a combinatorial synthesis (all 736 
commerdally available thiols) ^m the chemical literature. The results were obtained by 
"complete linkage** hierarchical cluster analysis of the resulting steric field matrices, using 
*'CoMFA_STD'* or "NONE** scaling, (Ck>MFA_STD impUes block standardization of each 
field, but without rescaling of the individual ''columns*' corresponding to particular lattice 

IS points, which hm produces the same clusters as no scaling). For clustering the ''distance* 
between any two molecules is calculated as the root sum of the squared differences in steric 
fidd values over all of the lattice intersections defined by the CoMFA *'region*'. 

In this example, cluster analysts using topomeric CoMFA fields produced a 
classification of reagents that makes sense to an experienced medicinal chemist. For example, 

20 when the topomerically aligned CoMFA fields of the 736-thiols are clustered, stopping when 
the smallest distance between clusters is dbqni 91 kcal/mole (within the "neighborhood** 
dtstamce of 8(>-lbO found for these fields in the validation studies), 231 discrete clusters result 
diSiering from each oth^ in steric aze by at least a -CHj* group. Upon inq)ection of the 
dustering, an experi^ced analyst will immediately recognizeHhat at this clustering levd of 

25 231, a natural break occurs, ie: the sqiaration between cluster level 231 and level 232 was 
greater than any encountered between levels 158 and 682. Further inspection of these results 
showed that, with perhaps ten excq>tions, each cluster contained only compounds having a 
very similar 2D topology or connectivity, while different clusters always contained compounds 
having dissimilar 2D topology. Indeed, so logical was the grouping that it was possible to 

30 provide a characteristic and distinctive systematic name for each of the 238 clusters using 
mosdy traditional or 2D chemical nomenclature as shown in Appendix "D". It is striking that 
this entirely automatic clustering procedure, based only on differences among the topomeric 
steric fields of 3D modds of single conformers, generates a classification that coincides so 
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well with chemical experience as embodied in an independently generated 2D nomenclature. 
From a pragnuitic point of view, this result may also be said to validate the validation 
procedure in the eyes of an experienced medicinal chemist who will tend to judge a metric by 
whether its assessments of molecular similarity and diver^ty agree with his/her own 
S experience. 

The critical aspect of this clustering result is that the structurally most Ipgical clustering 
was generated mth a nearest neighbor sq>aration of 91» in the middle of the 80 - 100 
ndghborhood distance determined from the validation procedure to be a good measure of 
dmilarity among the molecule in topomeric CoMFA moric space. That is, the ndghborhood 

10 distance of approximatdy 80 - 100 (corresponding to an approximate 2 log biological 
diffidence) predicted from the topomOTC CoMFA validation, generates, wh«i used in a 
clustering analysis, logical systematic groupings of similar chemical structures. The exact size 
of the neighborhood radius useful for clustering analysis will vary dq>ending upon: 1) tfie log 
range of activity which is to be included; and 2) tiie metric used since, in tiie real worid, 

15 different metrics yield different distance values for the same differences in biological activity. 
As seen, the topomeric CoMFA metric can be used to distinguish diverse molecules from one 
another - the vety quantitative definition of diversity lacking in the prior art which is necessary 
for the rationale construction of an optimally diverse combinatorial screening library. 

The discovered validation method of this inventicm is not limited to the topomeric 

20 CoMFA field metric but is generalizable to any metric. Thus, once any metric is constructed, 
its validity can be tested by applying the metric to sq^propriate literature data sets and 
generating the corresponding Patterson plots. If die metric displays the ndghborhood behavior 
and is valid/useful according to the analysis of the Patterson plots set forth above, the 
ndghborhood radius is carily determined fiom tiie Patterson plots once an activity difference 

25 is sdected. This ndghborhood radius can tiien be used to stop a clustering analysis when the 
distance between clusters svproaches tiie neighborhood radius. The resulting clusters are tiien 
representative of different aspects of molecular diversity with respect to the clustered 
property/metric. It should be noted tiiat a metric, by definition, is only used to describe 
something which has a difference on a measurement scale. This necessarily implies a 

30 •distance" in some coordinate system. Matiiematical transformations of tiie distances yielded 
by any metric are still ^'distances" and can be used in the preparation of the Patterson plots. 
For instance, tiie topomeric CoMFA fidd distances could be transformed into principal 
component scores and would still rq>resent the same measure. 
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Since the validity of the m^c is not dq)endait on the pardcular chemical/biological 
assays used to establish its validity, the metric can be applied to assemblies of chemical 
compounds of unknown activity. Clustering of these assemblies using the validated 
neighborhood radius for die metric will yidd clusto^ of compounds rqiresentative of the 
5 ^ffmni aspects of molecular divmity found in the assemblies. (It should be undes'stood that 
active molecules for any given assay may or may not reside in more than one cluster, and the 
cluster(s) containing the active compound(s) in one assay may not include die active 
compound(s) in a different assay.) 

As mentioned above, when designing an efficient combinatorial screening library, one 

10 wishes to avoid including more than one molecule which is representative of the same 
structural divmity. Thmfore, if a single molecule is included from ieach cluster derived as 
above, a true sample of the diver^ty rqiresented by all the molecules is adiieved without 
overiap. This is what is meant by designing a combinatorial screening library for optimal 
divosity. The methodologies of the present invention for the first time enable the achievement 

IS of such a design. 

5, Tfflimotp FmgCTprint Pwgriptor 

There are other nieasures of molecular similarity which are not metrics, that is, they 
do not corre^nd to a distance in some coordinate system but for which differences between 
molecules can be calculated. One such measure is the Tanimoto'' fingerprint similarity 

20 measure. This is one-of the 2D measuremoits frequently used in die prior art to cluster 
molecules or to partially construct other molecular descriptors. (Technically descriptors 
containing a Tanimoto term are not metrics since the Tanimoto is not a metric). 2D fingeqmnt 
measures were originally constructed to rapidly screen molecular-data bases for molecules 
having similar structural components. For the present purposes, a string of 988 has been found 

2S convenient and sufficiently long. A Tanimoto 2D fingerprint similarity measure (Tanimoto 
coefficient) between two molecules is defined as: 

No. Cf Bits Occuring 6 Both Molecules 
No, Of Bits e Either Molecule 

The Tanimoto fmgerprint simply expresses the degree to which the substructures found in both 
compounds is a large fraction of the total substructures. 
A Neighboriiood Prooertv 
30 At an American Chemical Society meeting in April, 1995, Brown, Martin, and Bures^ 
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of Abbott Labotatories presented clustering data generated in an attempt to d^ermine which, 
if any, of the common descriptors available in the prior art produced "bett^ clustering*. 
"Better clustering" was defined as a greater tendency for active molecules to be found in the 
same clusto*. One of the measures used was the Tanimoto 2D fingerprint coefficient calculated 
5 fiom the structures of the entire molecules (not just the side chains). Prqpri^aiy and publicly 
unavailable data sets were used by the AVbM group which covered a large number of 
icxnnpounds for v^ch die ai^vity or lack of activity in four assays had been experimentally 
verified over many years of pharmacological research. Although used as an analytical tool to 
measure clustering effectivoiess and not itself a focus of the presentation, one of the graphs 

10 Martin presented plotted the "proportion of molecular pairs in which the second molecule is 
also active" against the "pairwise Tanimoto similarity between active molecules and all 
molecules" (hereafter referred to as a "sigmoid plot"). From the resulting graph Martin et al. 
essentially found that if the Tanimoto coefficient of molecule A (an active molecule) with 
respea to molecule B is greater than ai^roximately 0.8S, th^ there was a high probability that 

IS mcdecule B will also be active; ie. , the activity of molecule B can be usefidly predicted by the 
activi^ of molecule A and vice versa. While not recognized or taught by the Abbott group at 
the time, the presoit inventors recognized that, for a very restricted data set, the Abbott group 
had data suggesting that the Tanimoto coefficient displayed a neighborhood property. 
P, AppUgabHity Of TanimPto Tp Diffgrpnt Biological Systems 

20 In order to det^mine whether the Tanimoto coefficient reflects a neighborhood property 

over a range of different biological assays, 1 1 ,400 compounds from Index Chemicus containing 
18 activity measures with 10 or more structures were analyzed. (Index Chemicus covers novel 
compounds rqx)rted in the literature of 32 journals.) Lade of a r^rted activity was assumed 
to be an inactivity although, in reality, the absence of a report of activity probably means that 

25 die compound was just untested in that system. For comparison purposes, this assumption is 
a more difficult test in which to discriminate a trend than with the Abbott data base where it 
was exp^mentally known whether or not a molecule was active or inactive. However, all tiiat 
is absolutely needed for this analysis is a high likelihood of having compounds that are "similar 
enough* in fingerprints to also be "simUar oiough" in biological activity. The conv^^, 

30 "sinular biological activity must have similar fingerprints", is pataitly untrue and is not tested. 
Table 2 shows the structures and activities analyzed. 
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TABLE 2 
Index Chemicus Activities 



Set 




Kological 




Set 


No. 


Biologica] 


No. 


Anal* 


Activity 




No. 


Anal. 


Activity 


1 


30 


Antianaphylactic 




11 


18 


Cytotoxic 1 


2 


12 


Antiasthmatic 




12 


133 


Enzyme Inhibiting 


3 


71 


Antibactmal 




13 


210 


Nematoddal 


4 


16 


Anticholinergic 




14 


12 


Opioid Rqptr. Bind 


5 


55 


Antifungal 




15 


39 


Platelet Aggr. Inh. 


6 


17 


Anti-inflammatory 




16 


11 


Radioprotective 


7 


21 


Antinucrc^ial 




17 


13 


Renin Inhibiting 


8 


13 


B-adrenorgic 




18 


11 


Thrombin Inhib, 


9 


21 


Brpnchodilator 










10 


34 


Ca Antagonistic 











15 To convert this data to sigmoid plots, the data lists were examined for everything which 

was active, and a Tanimoto coefficient calculated (on the whole molecule) between every 
active molecule and everything else in the list. For plotting, the value of the number of 
molecules which were a given value (X) away from an active compound was determined. The 
proportion (frequency of such molecules) was plotted on the vertical axis and the Tanimoto 

20 coefficient on the horizontal axis. The bin widths for the X axis are 0.05 Tanimoto difference 
imits wide, and the activity from Index Chemicus was simply "active** or "inactive*. Figures 
9 A and 9B show tiie resulting plots for 16 of the 18 data sets broken down into sets of 8 
(rq>lication of these Figures in the priority applications did not pick up the ninth curve in each 
Figure, so that the ninth curve in each set has been ommitted from this application). Many of 

2S the curves have a sigmoid shape, but the inflection points clearly differ. Also, it is not clear 
what effect excluding the differences between active and inactive molecules has on the shape 
of the curves. To get an overall view. Figure 9C shows the cumulative plot for both series of 
9 activities. This plot generally indicates that, given an active molecule, the probability of an 
additional molecule, which falls within a Tanimoto similarity of 0.85 of the active, also being 

30 active is, itself, approximately 0.85. Stated slightly differenUy, when a Tanimoto similarity 
descriptor is summed over an arbitrary assortment of molecules and biological activities, it is 
clear that molecules having a Tanimoto similarity of approximately 0.85 are likely to share the 
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same activity. Thus, the Tanimoto similarity displays a neighborhood behavior (neighborhood 
distance of approximately OJS) wh^ 2q;>plied to a large enough number of arbitrary sets of 
compounds. As will be discussed later, one of the more powerful aspects of the Patterson plot 
validation method is that it can provide a relative ranking of metrics and distinguish on what 
5 type of data sets each may be more useful. In this regard, it will be seen that the whole 
molecule Tanimoto coefficient as a diversity descriptor has unanticipated and previously 
uidaiown drawbacks. 

However, one of the prindple features of the present invention, ndther taught by the 
Abbott researchers nor recognized by anyone in the prior art, is that the Tanimoto descriptor 
10 can be used in a unique manner in the construction of a combinatorial screening library. In 
fact, as will be seen, it has been discovered that this descriptor can be used to provide an 
important end-point determination for the construction and merging of such libraries and, in 
additicm, is a useful descriptor for constructing and searching the virtual library. 

C. Comparison of Sigmoid and Patterson Plots 
15 It is important to understand the difference in the types of information about descriptors 

and the ndghborfaood property which is yielded by the Abbott sigmoid plot and die generalized 
validation method and Patterson plot of the present invention 

To make a sigmoid plot, the molecules must be first be divided into two categories, 
active m(riecules and inactive molecules, based on a cut off value chosen for the biological 
20 activity. One molecule of a pair must be active (as defined by the cut off value) before the 
pair is uicluded in the sigmoid plot. Pairs in which neither molecule has any activity, as well 
as those pairs in uiiich neither molecule has an activity greater than the cut off value, do not 
contribute information to the sigmoid ploL Thus, the sigmoid plot does not use all of the 
information about the chemical data set under study. In fact, it uses a limited subset of d^Oa 
25 dmvable from Uie more general Patlwson plot described above. As a consequence very large 
sets of data (or sets for which both the activity and inactivity in an assay are experimentally 
known) are needed to get statistically significant results from the sigmoid plots. 

By comparison, die Patterson plot clearly displays a great deal more information 
inh^ent in the data set whidi is relevant to evaluating the metric. Most importantiy, the 
30 validity and usefulness of the metric can be quickly established by examining the Patterson 
pl(Ms resulting from application of the metric to random data sets. As will be shown in the next 
section, a metric may reflect a neighborhood property (such as in a sigmoid plot), but at the 
same time may not be a particularly valid/useful metric or may have limited utility. In 
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Pattmon plot analysis, all pairs of molecules and their associated activities or inactivities 
contribute to the validity analy^ and to the determinations of the neighborhood radius. Thus, 
in a Patterson plot, it is easy to see what percentage of the total data set is included when the 
neighborhood definition is changed by choosing a different biological difference range. This 

S tes important consequ^ices for choosing the correct neighborhood radius for clustering. 

To better see the relationship between the information available from each type of plot. 
Figure lOA shows a Patterson plot for the Cristalli data set reconstructed under the Abbott 
^gmoid plot simplification that the 32 molecules were either "active" (activity = 1) or 
"inacttve" (activity = 0). The cut off value for biological activity was chosen to be 60 fcM. 

10 Thus, "active" molecules were those with an Al agonist potency of 60 pM or less, and 
"inactive" molecules were those with a potoicy greater than 60 ^M. With this Abbott 
simplification, only two differences in bioacttvities can occur for a pair of molecules: both 
active or inactive, difference = 0; or one active and the other inactive, difference = L The 
result of constructing a Pattoson plot for this impoverished data set thus must appear as two 

IS parallel lines, as shown in Figure lOA alongside the Patterson plot for the full Cristalli data 
set hi Figure lOB. Although a triangle and tnq>ezoid should still be anticipated within such a 
reduced plot, the active/inactive classification so limits the observable biological differences 
that no pattern whatsoever is apparent. The very limited nature of the information retained is 
clearly seen, in particular, by xmly looking at molecular pairs in which one molecule is active 

20 above a predetermined cut off value, die sigmoid plot totally fails to take into account all the 
information about the bdiavior of the metric with respect to nonractive pairs (in which one or 
both molecules have activities less than the cut off value) contsuned in the distribution of points 
in the Patt^son plot. As a major coiisequoice, the Patterson plot is: 1) able to derive 
information from much less data; and 2) much more s^sitive to all the nuances contained in 

25 tiiedata. 

6. Comparison of Tanimoto and Topomeric CoMFA Metrics 

Having recognized that both the topomeric CoMFA and Tanimoto coefficient metrics 
display the neighborhood property, a comparison (between Table 1 and columns 3 and 4 of 
Table 3) of the application of the two metrics to identical data sets yields interesting inughts 
30 into their respective sensitivities. The prior art practice of using the value of (1 - Tanimoto 
coefficient) as a distance was followed when performing the analysis. For colunms 3 and 4 of 
Table 3, Patterson plots were constructed using the Tanimoto distances of the whole molecules 
represented in the 20 data sets which had been used for the topomeric CoMFA analysis. 
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Patterson plots were also constructed using the Tanimoto distances of just the side chains (as 
was done with the topomeric CoMFA metric) of the molecules for the same 20 data sets. In 
Table 3 are shown the Tanimoto fingerprint density ratios for the whole molecule and side 
diain Tanimoto m^cs and the corresponding values for the 20 data sets. 
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TABLE 3 

Patterson Plot Ratios and Assodated X' 



No. 


; 

Reference 


1 

CpI. 1 

Side 

Chain 
Tanimoto 
Fingerprint 

Ratio 


Cof. 2 
Side 
Chain 
Tanimoto 
Fingerprint 


Col. 3 
Whole 
Molecule 
TanimoU) 
Rngeiprint 
1 Ratio 


■ 

Whole 

Molecule 
Tanimoto 
FinEemrint 


1 


Uehling | 


1.89 


14.22 


1 1.55 


6.22 


2 


Stnipczewski | 


1.70 


143.48 


1.41 


59.61 


3 


Siddiqi | 


1.04 


0.08 


1.04 


0.07 


4 


Ganatt-1 | 


1.60 


8.10 


1.07 


0.19 


5 


Ganatt-2 


1.89. 


36.05 


1.08 


0.50 


6 


Heyl ' 1 


1,71 


13.83 


1.01 


0.00 


7 


Cristalli 1 


1.75 


144.54 


1.31 


30.27 


8 


Stevenson { 


0.94 


0.05 


1.07 


0.04 


9 


Ddierty | 


1.73 


4.03 


1.05 


0.04 


10 


Penning | 


1.97 


37.03 


1.53 


12.73 


11 


Lewis 1 


1.64 


4.80 


1.01 


0.00 


12 


Kiystek | 


KOI 


0.04 


1.23 


16.31 


13 


Yoicoyama-1 j 


1.48 


9.94 


1.01 


0.00 


14 


Yokoyama-2 | 


1.37 


18.94 


1.70 


16.03 


15 


Svensson 


1.64 


16.61 


1.02 


0.02 


16 


Tsutsumi 


1.74 


21.56 


1.58 


14.35 


17 


Chang 


1.34 


145.00 


1.13 


8.36 


18 


Rosowsky 


1,04 


0.06 


1.01 


0.00 


19 


Hiompson 


1,72 


7.83 


1.17 


0.68 


20 


Depreux 


1,60 


64.22 


1.18 


6.73 




MEAN 


1.54 


34.62 1 


1.21 


8.61 




STANDARD 
DEVIATION 


0.32 


49.85 1 


0.23 


14.57 



25 



Surprisingly the whole molecule Tanimoto appears to be a good descriptor for only 
50% of the data sets (10/20 data sets with a ratio greater than 1.1), At first glance this is 
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stnpiidng in light of the original Abbott data, but, on second consideration, it is consistent 
with the observed significant individual variability of the plots obtained from the Index 
Oiemicus analysis in Figures 9A and 9B. The Patterson plots confirm that the Tanimoto 
co^Bdent does display a ndghboriiood prop^ for some data sets, but clearly it is less 

5 valid/useful for other s^. And it is not as consistoit as the tppomeric CoNfFA or the ^de 
chain Tanimoto descriptor which were valid 85% (17/20) and 80% (16A20) of the time 
ieq)ectivdy. Upon inq>ection.of the whole molecule Tanimoto data, itxan be s^ that the 10 
data sets which do not have ratios greater than 1.1 all have a small Tanimoto range and/or 
contain leladvdy few compounds. The values for these data sets also confirm the lack of 

10 statistical significance. Essentially, the whole molecule Tanimoto is a less discriminating 
div^ty measurement than the others and would appear to need, at the very least, more data 
and/or a greater range of values. The method of this invention cleariy provides much more 
information and insight into the validation of the Tanimoto metric than did the Abbott style 
sigmoid plot. 

15 For the majority of sets, 80%(16/20), the side chain Tanimoto metric also spears to 

be valid/us^. This is an extraordinarily surprising result since this metric has always been 
thought of in the prior art as useful only as a measure of whole molecule similarity. Oveiall, 
it compares favorably with topomeric CoMFA. A very interesting aspect, however, is that the 
sets for which validity is not sqiparent are not identical for the topomeric CoMFA and side 

20 chain Tanimoto metrics. The side chain Tanimoto metric does not, appear valid with respect 
to SOS 3, 8, 12, and 18. Clearly set 8 had too little data for either the topomeric CoMFA or 
the side chain Tanimoto descriptors. The most interesting comparison irivolves sets 3, 12, and 
18 which validated the topomeric CoMFA metric but for which the side chain Tanimoto metric 
zppezrs invalid. Upon inspection, these sets all contained substituaits in which only the 

25 position of a particular ^de chain varied. Since the topomeric CoMFA metric is sensitive to 
the relative spatial orientations of the side chains, while the Tanimoto metric is only sensitive 
tQ the presence or absence of the side chains, the sterically driven topomeric CoMFA metric 
was sensitive to the dififerwices in these sets while the Tanimoto was insensitive. In certain 
dicumstanccs the Tanimoto may be a useful descriptor of molecular diversity for use on the 

30 reactants in a combinatorial synthesis; a result totally at odds with the wisdom of the prior art. 
Clearly, however, the differences in sensitivities between the metrics should be considered 
when inlying them. 

Further, considering the five metrics already discussed above (topomeric CoMFA, 
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whole molecule Tanimoto, side chain Tanimoto, random numbers, and force field energy) it 
is dear that die validation method of this invention can be used to rank the relative quality 
(validity/usefulness) of die m^cs. In addition, when enough metrics have been examined by 
the method of this invention, it will be possible to choose metrics appropriate to the type of 
5 molecular Anu^ural differences which it is desired to analyze. Correspondingly, when a 
metric, «^ch has been validated over a v^ wide range of data sets and biological activities, 
y^ds suipri^g results {appears invalid) when qqdied to a new data s^, one potential 
interpretation may be that the data are in error. This highlights another feature of the 
invoititm, the abili^ to reliably suggest that some experimental observations are genmting 

10 unusual data. Instead of using a data set to validate a metric, the previously validated metric 
is used to examine the rdiability of the data set. By constructing Patterson plots and checking 
the associated value for significance^ experimental scientists have another tool with which 
they may indep^dentiy assess their data, especially in situations where new biological 
activities are being investigated. 

IS 7. Additional Validation Results 

Considering that the validation method of this invention has shown that both the 
topommc CoMFA metric and the Tanimoto metric define metric spaces where biological 
properties cluster (that is; the metrics are s^sitive to biologically relevant molecular strucutral 
diffetences), a descriptor combining the two metrics was construcuted. A combined descriptor 

20 has bieen identified which is the best diversity descriptor discovered to date. This descriptor 
has been validated and has been found to be far superior to any previously considered metric 
in its ability to identify a neighborhood of «milarity for design purposes. This descriptor, a 
weighted combination of the topomeric CoMFA descriptor and the Tanimoto descriptor, 
defines a distance measure as: 

^(1 -TanimotofHOSiOixtopomericCoMFA)^ 

25 This descriptor has a ratio greater than 1.1 in all 20 out of the 20 test data sets, and, in fact, 
avoages a ratio of hSS. In all 20 data sets for a neighborhood distance of 0.240 
(correspcmding to a biological activity difference of 2 log units) not one single point was found 
above die line in the Patterson plot. Although this may appear as a "perfect* metric, it is 
doubted that tiiis level will be maintained as more and more data sets are added to the 

30 validation group. However, it is believed that it will continue to be the strongest of the 
presentiy known descriptors. At the present time, the results of performing validation studies 
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on the combined descriptor and other possible metrics using the Patterson plot method of this 
inventimi and the 20 described data sets result in the following data: 



TABLE 4 
Patterson Plot Ratine 



No. 


Reference 


HB 


LOGP 


MR 


AP 


CONN 












1 AO 


1 A7 




1 in 
Ll9 


1.66 


L87 


2 


Oil U|llMU?WdRJ 


i •HO 


1 00 




1 ^A 


1 nc 

Luj 


1.20 


L47 


0 




1 AH 


A 0*7 

u.y/ 


A OO 


1 nn 
LOO 


1.07 


LOO 


1,48 


A 




a 


l.UJ 


1 A1 

l.Ul 


A nn 


1.11 


1.14 


1.68 


< 




a 


l.Ul 


1 AA 


n n*7 
0.97 


L09 


1.09 


1.50 


/; 


ncjri 


1 OA 




A 


1 1 1 


D 


1.01 


1.34 


7 


r'ri^fcilli 


1 97 


1 .X/U 


A 00 


1 97 
i .Z / 


A Ofi 
U.Vo 


1. 17 


\ A A 

1.44 


8 


Stpvpnwn 


A 


t 0'^ 






1 AO 


1 AO 


LoO 


9 


Dohertv 


1 07 


] 00 


1 At 
1 .1/1 


1.10 


1 AO 


1 Off 


1.76 


10 


Penning 


1.72 


LOO 


0 97 




1 AA 
l.UU 


1. . 


Lo7 


11 


Lewis 


*0.57 


LOO 


1.02 


0 97' 




1. 


1 AO 


12 


Krystek 


1.69 


0.85 


0.85 


1.43 


1 01 

♦ . V J. 


1 OA 


1 . / J 


13 


Ypkpyama-1 


*0.71 


d 


1.01 


1.25 


1.01 


0.99 


1.52 


Id 


I oKpydxnd'Z 




l.UU 




1.25 


1.05 


0.99 


1.57 


15 


Svensson 


*0.31 


1.01 


0,99 


1.31 


1.08 


1.00 


1.39 


16 


Tsutsumi 


1.67 


1.04 


0.95 


1.18 


1.00 


0.95 


1.52 


17 


Chang 


L35 


1.00 


LOO 


1.00 


c 


1.20 


1.36 


18 


Rosowsky 


1.44 


1.03 


0.96 


1.23 


1.08 


1.21 


1.66 


19 


Thompson 


a 


L12 


0,99 


0.87 


1.02 


1.01 


1.47 


20 


Depreux 


*0,44 


1.02 


0.99 


0.99 


1.01 


0.98 


1.26 




MEAN 


*1.43 


1.01 


0.98 


1.15 


1.05 


1.12 


1.55 




STANDARD 
DEVIATION 


*0.27 


0.05 


0.05 


0.19 


0.06 


0.17 


0.16 
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HB = Topomeric Hydrogen Bonding AP = Atom Pairs'* 
LOGP = Calculated Log P AUTO = Autocorrelation** 

MR = Molar Refractivity CONN = Connectivity Indices^* 

COMBO = Combined Topomeric CoMFA & Tanimoto 

S ^ Asterisked values are excluded in computing the mean. These values are all artifects, the 
result of there being no more than two distinguishable values of the molecular descriptor within 
the particular series; hence only two pos^ble values of the x variable in a Patterson plot. 
' No Hydrog^ bonding groups exist to define the metric under HB 
^ Too many groups for s/w to handle under CONN 

10 ^ One hexavalent atom confuses the computation under CONN 
* A LOGP could hot be calculated for the molecules in this data set 

Combining the data from Table 4 with the data from Tables 1 and 3 permits the relative 



ranking of some known metrics: 

VAUDITY/USEFULNESS RANK: No. Of Ratios > 1,1 

15 . USEFUL 

Combined Topomeric Steric CoMFA and Tanimoto 20/20 

Topomeric Steric CoMFA 17/20 

Tanimoto 2D Fingerprints (Side Chain) ^ 16/20 

Topomeric HBond Spatial Fingerprints _10/ 12 

20 LESS USEFUL: 

Tanimoto 2D Fingerprints (Whole Molecule) 10/20 

Atom Pairs (R. Sheridan) 1 1/20 

Autocorrelation 9/20 
NOT USEFUL - INVALID: 

25 Connectivity Indices 3/ 1 8 
(Health Design Implementation, first 10) 

Partition Ck)efficient (CLCKSP) 1/19 

Molar Refractivity (CMR) 0/20 

Force Field Strain Energy 0/ 1 8 

30 Random Numbers 0/20 



Note: A denominator of less tiian 20 indicates that the metric could not be calculated 
for all 20 data sets. 
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8. Combinatorial Library Desi gn Utilizing Validated Metrics 

The starting point for the design of any combinatonal saeening library is the choice 
of synthetic reaction scheme involving the selection of the core molecule and the possible 
leactants which could be used mih any specific chemistry. As mentioned earlier, well known 
S and understood organic reactions are generally utilized. Initially, information about the 
chemical structure of all the reactants (and cores, vAten appropriate) and the synthetic 
—Chemistry involved (what products can be built)-is input as a database in the computer in- a 
form recognizable by the computational software. Using the insights gained torn the discovery 
of the yalidaUbn method of this invention, it is now possible to design general purpose 

10 combinatorial screening libraries of optima] diversity. 

Conceptually, the design process may be thought of as a filtering pixx:ess in which the 
molecules available in a combinatorially accessible chemical universe are run through 
consecutive filters which remove different subsets of the universe according to specified 
criteria. The goal is to filter out (reduce the numbers of) as many compounds as possible while 

15 still retaining those compounds which are necessary to completely sample the molecular 
diversity of the combinatorially accessible universe. The basic design method of this invention 
along with several ancillary considerations is shown schematically in Figure 1 1 using the filter 
analogy. For this example only two sets of reactants are considered with one reactant of each 
set being contributed to each final product molecule. The reactants are shown forming the top 

20 row and first column of a combinatorial matrix A. Only a portion of the possible combinatorial 
matrix is shown, theremainder being indicated by the sections connected to the matrix by dots. 
One set of reactants is rq)resented by circles I, and the other set by squares 2. Each empty 
matrix location represents one possible combinatorial product which can be formed fixjm the 
two sets of reactants. (The matrix of possible products would-be a rectangular prism for three 

25 sets of reactants, and a multidimensional prism for higher orders of reactant sets.) As the 
design process is implemented, the number of products to be included in the scrwning library 
design is reduced by each filtw 4. Beside each filter step is indicated the corresponding text 
section describing that filter. Also set out opposite each filtering step is an indication of the 
software and its source required to implem^t that step. 

30 A. Removal Of Reactants For Non-Diversitv Reasons 

In designing screening libraries derived from combinatorially accessible chemical 
universes, practical and end use considerations as well as diversity concerns can be used to 
reduce the number of reactants which will be used to combinatorially specify the product 
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molecules. These practical and end-use criteria can be divided into those of general 
£^licability and those of more specific aj^licability for a particular type of screoiing library 
(such as for drug discovery). The following discussion is not meant to be limitingp but rather 
is intended to suggest the types of selections which may be made. 
5 i. General Removal Criteria 

As a first consid^ation, reactants with unusual elements (such as the metals) are 
normally excluded- when considering the synthesis of organic molecules. In addition, 
tautomerization of structures can cause problems when searching a universe of reactants data 
base dther by missing structures that are actually present or by finding a q)ecific functional 

10 group which is really not there. The most common example of this is the keto-enol 
tautomerism. Thus, possible tautomeric reactants must be examined and improper forms 
diminated from consideration. Generally, reactants may be provided in solvent, as salts with 
oounter*ions, or in hydrated forms. Before their structures can be analyzed for diversity 
purposes, the salt counter-ions, solvent, and/or other species (such as water) should be 

IS removed from the molecular structure to be used. 

Additionally, reactants may contain diemical groups which would interfere with or 
prevent the synthetic reaction in which it is desired to use them. Clearly, either different 
reaction conditions must be used or these reactants removed from consideration. Sometimes, 
viiile the synthesis may be possible, extraction of the products resulting from some reactants 

20 may be difficult using the proposed synthetic conditions. Again, if possible, another synthetic 
scheme must be used or the reactants removed from consideration. Price and availability are 
not insignificant con^derations in the real world. Some reactants may need to be specially 
synthesized for the combinatorial synthesis or are otherwise very expensive. In the prior art, 
expensive reactants would typically be elinrinated before proceeding further with the library 

25 design unless they were felt to be particularly advantageous. One of the advantages of the 
, method of this invention is that the decision whether to include expensive reactants may be 
postponed until the molecular structures have been analyzed by a validated descriptor. With 
confidence that the validated descriptor permits clustering of molecules repres^ting similar 
diversity, often another, less expensive, reactant can be selected to represent the diversity 

30 cluster which also includes the expensive molecule. The specifics of any particular 
contemplated combinatorial synthesis may suggest additional appropriate filtering criteria at 
this level; In Figure 1 1 the effect on the number of possible products of removing only a few 
reactants is easily seen in matrix B. For each reactant removed, whole rows and columns of 
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possible products are excluded. 

ii, BiQiQRji^ftHy Criteria 
A library designed for screening potential pharmacological agents imposes it own 
limitations on the type and size of niolecules. For instance, for drug discovery, toxic or 
5 metabolically hazardous reactants or those containing heavy metals (oiganometailics) would 
usually be excluded at this stage. In addition, the likely bioavailability of any synOietic 
compound would be a reasonable selection critma. Thus, the size of the reactants needs to be 
considered since it is well known that molecules above a given range of molecular weights 
genoally are not easily absorbed. Accordingly, the molecular wdght for eadi reactant is 

10 calculated. Since the final molecular weight for a bioavailable drug typically ranges ftom 100 
to 750 and since, by definition, at least two reactants are used in a combinatorial synthesis, 
reactants having a size over some set value are excluded. Typically, those above 600 are 
excluded at this stage at the present time. A lower value could be used, but it is felt that there 
is no reason to restrict the diversity unduly at this stage in the design process. Once again, of 

15 course, this value can be adjusted depending on the chemistry involved. 

Another aspect of bioavailability is the diffusion rate of a compound across membranes 
such as the intestinal wall. Reactants not likely to cross membranes (as determined by a 
calculated LogP or other measure) would usually be eliminated. At the present time, although 
the CLOGP for reactants makes only a partial contribution to the product CLOGP, it is 

20 believed that if any reactant has a CLOGP greater than 10. it will not make a usable product. 
Accordingly, the CLOGP is calculated for each reactant and only those with CLOGP ^ 10 
are kept Again, in any particular case, a different value of CLOGP could be utilized. For 
those reactaints for which it is difficult or impossible to calculate a LOGP, it is assumed the 
CLOGP would be less than 10 so that the reactants are kept in the library design at this point. 

25 As will be discussed later, a CLOGP will also be calculated on the products. 

Other reactants are considered undesirable due to the presence of structural groups not 
considered "bio-relevant". Bio-relevance is judged by comparison with known drugs and by 
the experience of medicinal chemists involved in the design of the library. It is hoped that a 
future formal analysis of drug databases will yield further information about which groups 

30 should be excluded. Exclusion on this basis should be minimized since one of the goals of the 
combinatorial library design process is to find biologically active molecules through the 
exploration of combinatorial chemistry space which might not otherwise be found. Other 
removal criteria may be based on whether possible reactants involved sugars or had multiple 
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functionalities. At the present time, the compounds shown ini Table 5 are believed to be 

undesirable and are generally excluded at the initial stage of library design. 

TABLES 
Biologicallv Non-Relevant Groups 



5 


GROUP 
DEFINITION 


SYBYL Line Noution (SLN) 


«_ 

Reason(s) For 1 
Exclusi<m 




BOC 


C(OC(=0)NXCJeKCH3)CH3 


Stability 




FMOC 


Q i]H:Q2]:C(:CH:CH:CH(^] )CH{CH20C(= 0)N)\ 
q22]:C@2:CH:CH:CCH;CH:®22 


Stability 


10 


Hydrolyzable acyclic 
' groups 


Lvg-r!r]C(.AnyH5 rlLvg{Lvg:0 1 N | Br| a 1 1} 


Stability 




Silicon, Aluminiuiii, 
Calcium 


St, Al. Ca 


Unfashionable 




Potyfaydroxyls/sugars 


HOCC(OH)COH 


Extraction Difficulties 




AUyl halides 


HaloC(Any)C=: Any{Halo:Br| CI 1 1) 


Stability, alkylating 
agent 


15 


Benzyl hatides 


HaloC(Any)C=:Any{Halo:Br| CI 1 1} 


Stability, alkylating 
agent 




Pheoacyl halides 


HaloC(Any)C= : Any {Halo:Br| CI j 1} 


Stability, alkylating 
agent 




Aipha-halo catbonyls 


HaloC(Any)C= :Any{Halo:Br| Ci 1 U 


Stability, alkyUting 

i 




Acyl halides 


Csp(=0)Hal{Csp:C|SjP} 


Stability, alkylating 
agent 




Phosphyl halides 


Csp(=0)Hal{Csp:CjS|P> 


Stability, alkylating 
agent 


20 


Thio halides 


Csp(=0)Hal{Csp:C|S|P} 


Stia>ility, alkylating 
agent 




r Carhom^es 


NoroC(=O)Hal{N0To:N 1 Oi S) 


Stability, alkylating 
agent 




Chlorofomutes 


NoroC(>=0)Hal{Noio:N 1 0 1 S} 


Stability, alkylating 
agent 




Isocyanates 


N-C-Hct 


Stability, alkylating 
agent 




Thioisocyanates 


N=C=Het 


Stability, alkylating 
agent 


25 


Dtimides 


N=C=Het 


Stability, alkyhting 
agent 




1 Sulfonating agents 


Het(=0)(=O))Lvg{Lvg:0Hev| Hal} 


Stability, alkylating 
agent 
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y Phosphocylating 
1 agents 


Hct(=0)(-0))Lvg{Lvg:OHcv 1 Hal} 


Stability, alkylating 
agent 




Epoxides, etc. 


C(i]HetG@l 


Stability, alkylating ■ 
agent | 


DiazDS 


Any-NfFJ-NfFJ 


lability, toxicity | 


Azides 


Any - N[F1 - N[F1 -Ooni[FI{Oom:0| N} 


Stability, toxicity | 


NttroGo 


Any-NfFl-NrFJ-OonirFKOofniOlN} 


Toxicity || 


Mustafds 


HaIoC(Aay)C(Any)Lvg{Lvg:Het| Halo}{HaJo:Br| 0 1 1} 


Stability^ alkylating 1 
agent | 


2-haloethefs 


«aloC(Any)C(Aiiy)Lvg{Lvg:Het| Halo}{Halo:Br| ai 1} 


Stability, alkylating 
agent 




Quatenuuy Nitrogeas 


Hcv ~ Nofp( - HcvK - Hcv) - Hev{Noq): P | N} 


Extraction difficulties 




Quaternary 
Phosphorus 


Hev - Nofp( ~ HevK -Hev) - Hev{Nofp;P | N} 


Extraction difficulties 




Acid anhydrides 


Het=Any-fir|0-[!r]Any=Het 


Stability, alkylating 
agent 




Aldehyde 


CCH=0 


Subility, alkylating 
agent 




Polyfluorinates 


FC(F>C(F)F 


Unfashionable j 




Michael acceptor 


0=C(Nothet)-C=Any(H)Nothet{Nothet:C|H} 


Toxicity | 








Stability | 




Other Triaryls 


Any:Any-[!r]Any(-(!r]Any:Any)\ 
H!r]Any:Afty)Lvg{Lvg:Het| Hal) 


subility 




Alpha-^icarfoonyls 


Oom=(!rlAny{AnyHev)-C=[!rlOom{Oom:0|N} 


Stability 





The choice of whether to eliminate some reactants based on such general and specific 
considwations will vary with the given situation. Except in the case of toxic matmals, it is 
reo^nized that any other limiting selection decreases the diversity of the combinatorial library 
and potentially eliminates active molecules. As always, when eliminating reactants at the very 
beginning of library design, the problem boils down to a question of probabilities: what is the 
likelihood of missing a significant lead molecule? In the real world, what is desired at the very 
least is a high probability that it is unlikely that such a molecule will be missed if the selection 
criteria under consideration are implemented. The application of many of these selection 
criteria (price, availability, toxicity, bioavailability, diffusion, and non-biologically relevant 
structural groups) can occur before, during, or after the screening library has been selected 
based on other criteria. Clearly, however, the eariier these selection criteria are applied, the 
greater will be the reduction in the number of combinatorial possibilities which will need to 
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be evaluated later in the design process. As will be discussed below, not only are these criteria 
applied at the reactant level, but some of them moU also be applied again at the product level. 
Reduction of the number of reactants (for the r^isons set forth above) in the early stages of 
the library deagn process is indicated in Figure 1 1 at matrix C. 
S B. Removal of Non-Divetse Reactants 

As noted earlier, an ideal combinatorial screening library will: 1) have molecules 
iqiresenting the entire range of div^ty present in the chemical universe accessible with a 
given set of combinatorial materials; and 2) will not have two examples of the same diversity 
when one will suffice. The goal is to obtain as comi^ete a sampling of the diversity of 

10 chmical space as is possible with the fewest number of molecules, and, coincidentally, at 
lowest cost. In selecting a subset of a possible combinatorial universe to include in a screening 
library, there are two opportunities based on diversity considerations to reduce the number of 
included molecules. The first opportunity occurs when selecting reactants for the combinatorial 
synthe^s. The fewer the number of reactants, the much fewer the number of combinatorial 

IS possibilities. The second opportunity occurs after all the combinatorial possibilities from the 
chosen reactants (and core) have been selected. The method of the present invention utilizes 
both opportunities by using validated metrics appropriate to each situation. 

Any metric which has been shown by the Patterson plot validation methodology to be 
valid/useful when applied to reactants may be used at this stage of the library design process. 

20 However, there are a number of reasons to use a metric which reflects the steric diversity of 
tiie combinatorially accessible chemical universe. The principle reason is that the accumulated 
observation of biological systems is that ligand-substrate binding is primarily governed by three 
dimensional considerations. Before a reactive side group can get to the active site, before 
a{^)rq)riate electrostatic interactions can occur, before appropriate hydrogen bonds can be 

2S formed, and before hydrophobic effects can come into play, the ligand molecule must basically 
"fit" into the three dimensional site of the substrate. Thus a principal consideration in 
designing screening libraries should be to sample as much of the three dimensional (steric) 
divmity of the combinatorial universe as is possible. The initial method of the present 
invention does this by utilizing the validated topomeric CoMFA metric to analyze the steric 

30 properties:of the proposed reactants. 

A second reason for applying a steric metric to the reactants is that all of the three 
dimensional variability of the products resulting from a combinatorial synthesis resides in the 
substituents added by the reactants since the core three dimensional structure is common to all 
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molecules in any particular combinatorial synthesis. In a sense it would be redundant to 
measure the contribution to each product molecule of a core which is common to all the 
products. A third reason for S4>plying a three dimensional metric to the reactants is that a 
sterically sensitive metric distinguishes differoices among molecules that ax^ not revealed using 
5 other presently known metrics. For instance, the topomeric CoMFA metric is more sensitive 
to die volume and shape of the space occupied by a molecule than is, for instance, either the 
side chain or whole molecule Tanimoto descriptor. Figure 12 provides an illustrative example 
of this feature drawn from the thiol study which confirms what was seen in the Patterson plots 
of the topommc CoMFA and Tanimoto whole molecule descriptor. Figure 12 shows three 

10 clustm labeled 24, 25, and 29 for which the Tanimoto whole molecule fmgerprint metric does 
not indicate any substantial difference in molecular structure among tiie molecules, labeled (a) 
through (0, making up each of the clusters. The large panel A in the upper right of Figure 12 
shows orthogonal 3D views of the volume differences within clusters 24, 25, and 29 
comparing each of the molecules that are not in die majority steric field cluster. For example, 

15 the Cluster 24 figure B at the top shows four contours (yellow, green[hidden], red, and blue) 
indicating the differences in volumes occupied by compounds 24(a), 24(b), 24(c) and 24(f) 
compared to compounds 24(d) and 24(e) which are found in the same steric field cluster, 
number 10. The middle C and bottom D figures in the large panel A show similar 
distinguishable volume diffcrwices for Ousters 25 and 29. While Uie whole molecule Tanimoto 

20 metric does not distinguish much difference between the molecules within each of these 
clusters, it is readily apparent from Figure 12, even to an untrained eye, that the molecules 
in the clusters represent very different types of structural diversity; that is, significantly 
different three dimensional volumes are occupied by the molecules within each whole molecule 
Tanimoto determined cluster. The topomeric CoMFA metric clearly shows steric differences 

25 that are not indicated by the 2D Tanimoto. As seen eariier, a side chain Tanimoto similarity 
descriptor also does not distinguish steric differences amongst some molecules. A metric 
responsive to steric differences is, therefore, clearly preferred as a diversity discriminator for 
reactants. 

The initial meUiod for selecting reactants based on diversity is shown schematically at 
30 die third filter in Figure 1 1 . A diversity selection based on three dimensional steric measures 
begins by: 1) generating 3D structures for die reactants; 2) aligning the 3D molecular 
structures according to the topomeric alignment rules; 3) generating CoMFA steric field values 
for the reactants including, if desired, hydrogen bonding fields, and applying a rotatable bond 
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attenuaticm factor, and 4) calculating pairwise topomeric CoMFA differences for every pair 
of reactants. At this point the steric diversity of the reactan( space has been mapped inu> the 
topomeric CoMFA metric space* From the validation of the topomeric CoMFA metric, it was 
found that the neighborhood radius for an apparent activity difference of 2 log units was 

5 defined by a distance of ^iproximately 80 - IQO topomeric CoMFA units (kcal/mde). 
Ttierefore, at this point, the method of the invention dustm (using hierarchical clustering) the 
reactants in t(qx)meric CoMFA space so that reactants having a pairwise difference of less than 
s^roximatdy 80-100 units are assigned to the same cluster. Put another way, clustering is 
continued until the inter-cluster sqxuation is greater than s^roximately 80 - 100 units. (If 

10 desired, there is some leeway in choosing the exact neighborhood radius in and about the 
neighborhood range to use for any given biological system. An experienced practioner of the 
clustering art will easily be able to determine, by noting the natural breaks in the clustering, 
where about the 80-100 range best clustering is obtained.) This process will produce clusters 
having reactants whose product activities will only rarely differ by more than approximately 

IS 2 log units. If reactant clusters having products activities difTering by a greater or lesser 
amount are desired, the neighborhood distance used may be increased or decreased 
accordingly. The effect on the neighborhood distance of choosing such other activity range can 
be seen by viewing the Patterson validating plots for the topomeric CoMFA descriptor. 

The clustering process now identifies groups (clusters) of reactants having steric 

20 diversity ftom one another but also having the same steric properties-within each cluster. Or 
put in terms familiar to medicinal chemists, the molecules of each cluster should be bioisosters. 
For purposes of designing a combinatorial screening library which has within it molecules 
representing the full range of steric diversity present in the universe of reactants, it is now only 
necessary to select one reactant from each cluster for inclusion in the library. A reasonable 

2S way to select the one reactant firom each cluster would be to select the lowest priced or most 
readily available one. However, additional criteria may be considered. The diverse reactants 
remaining at matrix D need not be adjacent to each other on the combinatorial matrix and are 
only shown this way for graphic convenience. At this point the first stage of library design has 
been completed. 

30 While the use of a topomeric CoMFA metric to measure the three dimensional 

structural diversity of the reactants has been discussed, it should be apparent that any metric: 
1) reflective of the three dimensional properties of molecules; and 2) validated as taught above, 
could be applied to the reactants to be used in a combinatorial synthesis in the manner taught 
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above. The teaching of this invention is not limited to the use of the topomeric CoMFA metric, 
but also includes the use on reactants of all validated three dimensional metrics. As seen 
eadior, at the present time initial studies of tt)pomeric hydrogoi bonding fields indicate that 
it should be a very useful metric. For tfiose reactants expected to form large number of 
S hydrogen bonds, this may be the metric of choice. The hydrogen bonding metric would be 
used as an adjunct to the topomeric CoMFA m^c in those dtuations. There may be situations 
vfhcit a sterically sen^tive m^c is not needed, in which case it should be clear thatany-valid 
metric appropriate to reactants could be used. 

C. Identification (Building^ Of Products 

10 Once the set of diverse reactants has been identified by the above method, the structures 

of the product molecules can be combinatorially determined based on the synthetic reaction 
scheme and any desired cores. The reactants are used to build the structures of the 
combinatorial products using LEGION znd are stored in molecular spread sheets. In matrix E 
the products which can still be built from the available reactants are shown as asterisks in each 

IS matrix location. 

D. Removal Of Products For Non-Diversitv Reasons 

After the possible product structures have been identified, another opportunity exists 
to reduce the number of products due to general non-diversity considerations. These 
considerations will gmerally be related to the particular chemistry involved and might relate 

20 to product instabilities, cyclic structures, etc. (Matrix F) 

During the building of the combinatorial product molecules, the size of the product 
molecules increase and various combinations of core and substituents will affect the likely 
diffusion of the molecule (and may even form one of the biologically undesirable molecular 
groupings). Thus, in order to eliminate molecules which would not be used as drugs, the 

25 product molecules should be examined with many of the same selection criteria applied to 
reactants. In particular, rnolecular weights should be calculated and those compounds which 
have molecular weights over a predetermined value should be rejected. Typically, a value of 
750 is used at this time as a representative weight above which bioavailability may become a 
problem. In addition, CLXXjP should be calculated and any proposed molecule with a value 

30 under -2.5 or over 7.5 rgected. The number of structures eliminated at this point will depend 
in part both on the chemistry involved and ttie molecular weight range retained at the reactant 
stage. These additional product structures which are eliminated are reflected in matrix G. 
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E. Removal of Non-Diver se Products 

As noted, a second opportunity based on diversity conuderations to reduce the number 
of molecules to be included in the combinatorial screening library occurs after the products of 
a proposod combinatorial synthesis have been "built" by the software in the computer* Sudi 
S an additional reduction is usually necessary since the number of combinatorial products at this 
stage may still be astronomically large. This is reflected in matrix G. In addition, it makes no 
sense to screen any more molecules than is absolutely necessary, and-redundancy may occur 
in the products for several reasons. In a simple case, if two diverse reactants may react 
indq)endCTtly at each of two possible sites on a symmetric core molecule, two identical 

10 prcxluct molecules will be generated. In a more complex case, it is possible that one 
combination of core and reactants is similar (due to the similarities of structures contained in 
the core to the structure of the reactants) to another combination of core and reactants. Tliat 
is, when the reactants are combined with the core molecule, it is possible that substructures 
within the core can combine with different substituents to form similar structures. Clearly, il 

IS would be redundant to screen both. How to select product molecules has been a vexing 
problem in the prior art, and this is one reason why the prior art has basically been concerned 
with clustering criteria. The general approach taken in the prior art to avoid oversampling 
combinatorial product molecules representing the same diversity has been to cluster the 
molecules aiul then maximize the distance between clusters with whatever metric was applied 

20 to die products. 

Based upon an imderstanding developed from the theoretical considerations of validating 
a metric outlined above, the library design method of this invention again makes use of the 
neighborhood prindple to solve this problem. However, it is important to understand that, 
unlike some methods of the prior art, the method of this invention specifically does not use a 

25 metric to cluster product molecules. Rather, the neighborhood definition may be used to decide 
which product molecules to retain in the final screening library and, correspondingly, when 
the 2^ropriatei number of product molecules have been selected for inclusion in the library. 
Essentially, starting with one product molecule, additional molecules are selected as far apart 
as possible (in the validated metric space) from any molecule already in the library until the 

v30 next molecule to be selected would fall within the neighborhood distance of a molecule already 
included. Additional molecules are not included because to do so would include two or more 
molecules within the library representing tiie same structural diversity. Therefore, the 
neighborhood principle is used as a sampling rule to insure that molecules representative of 
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the same diversi^ or otherwise too similar are not included in the library. The resulting 
combinatorial screening library is not redundant and has not ovcrsampled the diversity space. 

In the present invention, the Tanimoto 2D whole molecule similarity coefTtcient is used 
for the final product selection. As was seen above, this metric possesses the ndghboriiood 
5 property. Accordingly, from the combinatorial products either a first product is arbitrarily 
chosen for inclunon in the library or an initial seed of one or more products may be ^)edfied. 
(If an arbitrary product molecule is chosen, Tanimotoxoeffict^ts are calculated for all oA^er 
molecules to the first molecule and a second molecule with the smallest Tanimoto coefficient 
[greatest distance * least ^milariiy] from the first is chosen for inclusion.) For the effici^t 

10 selection of additional molecules to be included, the distance (1 - Tan. Coeff.) between each 
additional molecule and all molecules already included in the library is calculated. For each 
additional molecule, the distance to the closest molecule already in the library is identified. 
These closest distances for each additional molecule are compared, and the additional molecule 
whose closest distance is the greatest is selected next for inclusion; that is, the molecule which 

15 is farthest away from the closest molecule in the library is selected. A new set of distances is 
calculated and the process continued, selecting one molecule at a time, until no more molecules 
remain which are farther away than 0.15 ([1 - 0.85] the definition of a Tanimoto "distance" 
using the neighborhood value of 0.85). While this example is presented in terms of the 
Tanimoto similarity coefficient, any validated whole molecule metric and its neighborhood 

20 definition may be used with this sampling procedure. 

As noted eariier, the value of 0.85 for the Tanimoto neighborhood definition originally 
appeared in the sigmoid plots. To confirm wheUier diis is the correct neighborhood definition 
for the Tanimoto metric, the Patterson plots for Uie whole molecule Tanimoto in which the X' 
indicated significance were used to calculate the ndghborhood value. The metric distances 

25 correqxmding to 2-log and 3-log biological differences were determined by dividing tiie slope 
of the density determined line by tiie values 2 and 3 respectively. Over the data sets, the 
average metric distance for a 2 log biological difference was 0.14 and the average metric 
distance for a 3-log biological difference was 0,21. Since the Tanimoto distance of (1 - Tan, 
Coeff.) is {dotted in tiie Patterson plot, tiiese values correspond to a 2-log similarity of 0.86 

30 anda3-logS!milarity of 0.79. This confirms tfie reasonableness of using 0.85 in the sampling 
process. Also, as discussed earlier, it is reasonable to have more confidence in the definition 
of Uie neighborhood derived from the Patterson plots which utilize all the molecular data. As 
noted witii reference to selection of a neighborhood distance using Uie topomeric CoMFA 



wo 97/27559 PCT/US97/01491 

64 

metric on reactants, there may be a situation where a different biological activity may be 
appropriate and a correspondingly different neighborhood distance used for product selection. 

Concq>tually this selection process is reflected in Figure 13. Figure 13 shows a plot 
of the Tanimoto 2D pairwise similarities for a typical combinatorial product universe in which 
5 there has been some selection of reactants based on div^ty. As can be seen, a very large 
percentage of the products have similar structures (Tanimoto coefBcients > 0,85). The 
sampling prcx:ess outlined above-re^ts in the following. Molecule having pairwise 
similarities above approximately 0.8S have overlapping neighborhood radii as shown at 1 and 
oi» of each pair is excluded from the library. Molecules having pairwise similarities of 

10 s^sproximately 0.85 have almost touching but not overlaying ndghborhood radii as shown at 
2 and are included in the library. Molecules having pairwise similarities significantly less than 
approximately 0,85 have no overlapping ndghborhood radii as shown at 3 and are also 
included in the library. Excluding molecules with a Tanimoto similarity greater than 0,85 will 
eliminate a significant number of molecules in this representative product assembly. This 

IS reduction is also reflected in matrix H, While the circles of similarity shown in Figures 13 
represent convenient concq)tualizations of the neighborhood distance concept, it should be 
remembered that most metrics will not defme a space in which the ''distance* corresponds to 
an area or volume. In particular, a Tanimoto similarity space does not have this property, yet 
the "similarity" to a neighbor can be defined and is very useful. 

20 A specific example illustrates the dramatic power of the flnal selection stage in the 

design process, A proposed combinatorial screening library was designed using thiols and 
sulfonyl chlorides as reactants. (Many of the same thiols were considered in the study 
discussed earlier.) The original 716 thiols and 223 sulfonyl chlorides considered would make 
159,668 potential products. Topomeric CoMFA analysis indicated that 170 thiols and 61 

.25 sulfonyl chloride reactants represented diverse molecules for the purposes of this design and 
should be used in further library design. 10,370 combinatorial products were now possible. 
Grs^h 1 of Figure 14 shows the Tanimoto similarity distribution of the 10,370 possible 
products. It can be seen that a large percentage of the possible products were at least 0.85 
similar to each other. Following the fmal stage selection process of the method of this 

30 invention, 1 ,656 product molecules were selected none of which was 0.85 similar to the other. 
Gniph 2 of Figure 14 shows the plot of the Tanimoto similarities of the final library design 
products. (The Y axis of the graph is plotted in fraction per % so that the integrated totals are 
proportional to 10,370 and 1,656 respectively.) The remarkable selectivity of the sampling 
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process is immediately zppaimL The products of the designed library have a clearly different 
amilarity profile than the non-sdected products. In addition, there has been a greater than 6: 1 
reduction in the number of product compounds. Thus, from a possible universe of 159,668 
potential combinatorial products, 1,656 have been identified which rq)resent the structural 
5 div^ty of the large ensemble. An approximate 100:1 reduction has beoi achieved without 
sacrificing the diversity of the combinatorially accesable universe. As a result of the libraiy 
design, only the 1,656 compounds have to be syntherized. In addition, these same 1,656 
compounds can be tested in any number of biological assays with a high degree of assurance 
that even in assays with unknown biological activity requirements, these compounds will 

10 present the diversity of compounds accessible through this combinatorial universe to the 
biological assays. Thus there is not only a savings in time and expense in the synthesis and 
testing of the identified molecules in the library, but it is not necessary to change library 
design (with concomitant time and expense) each time it is desired to screen a different 
biological assay. Over time, u^g the library design of this invention and the process for 

15 merging libraries discussed below, it will be possible to build up an optimally diverse 
combinatorial screening library based on many different combinatorially accessible universes, 
and this combined library will rq>resent the first real general purpose screening library 
available to the art - a realization of a long sought after, and previously believed unattainable, 
goal. 

20 Clearly, other validated whole molecule metrics and their associated neighbortiood 

distances can be used with the sampling process described above to select product molecules 
for inclusion in a screening library. However, it makes no sense to use the same metric for 
the products as was used for the reactants. For instance, in the case of the topomeric CoMFA 
metric, no information would be gained if the metric was used again with the_products since 

25 all the steric information from the reactants has been transferred to the products. What is 
critical is that the combinatorial screening library should be constructed by including product 
molecules which do not fall within the neighborhood radius of other molecules and excluding 
product molecules which fell within the neighborhood radius of previously chosen molecules. 
At die end of the design process of this invention, a list of product structures and the reactant 

30 sources for each is available in the computer and can be output either in electronically readable 
or visually discemable form. This data defines the combinatorial screening library. The list 
of reactants is supplied to synthetic organic chemists. Actual synthesized molecules are then 
available for testing in the biological assays, typically on multiple well plates. The list of 
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products from each library design can be used to create a definition of a larger combinatorial 
screening library when merged with other such libraries as discussed below. 

The combinatorial screening library designed by the method of this invention is both 
locally diverse (no two reactants rqpresenting the same steric space are present) and globally 
diverse (no two products having overall similar structures are present). Such a library tfius 
meets the desired combinatorial screening library criteria of bdng representative of the 
~div^ty of the entire combinatoriallyaccessible chemistry universe while at the same time not 
containing more than one sample of each diversity present (no oversampling). An optimally 
diverse combinatorial screening library has thus been achieved. By designing an optimally 
div^se screening library, a reduction in the number of combinatorially gwierated structures 
which need to be synthesized and tested of substantially greater than 10^ - Itf should be 
possible. 

Q. Lead Comoound Optimization 

Unless an entire combinatorially accessible chemical universe is screened, a lead 
molecule found from scre«iing a library will rarely be the most active or tiie optimal molecule 
desired. Therefore, extensive additional work is usually required searching for a related 
compound possessing the greatest activity or some combination of activity and another 
desirable feature such as bioavailability. Most of the time, the design of the screening library 
from which the compound was identified provides littie, if any, help in this search. Again, 
medicinal chemists must resort to traditional methiods of lead development. Combinatorial 
screening libraries based on the methods of this invention provide the means for a directed 
search of the chemi^ry space in a way not possible with prior art libraries. 

TTus feature results diiectiy from the fact that tiie libraries are constructed at each level 
by selecting molecules which are representative samples of -particular molecular diversities. 
Thus, once a lead is identified, it is a straightforward matter to idoitify and test compounds 
representative of Uie same and/or closely related diversity; ie,, it is known how to identify 
molecules within the ndghborhood of the active lead, as defined by the validated metrics used 
to construct the screening library. FurOiermore, the synthetic chemical mediods used to 
construct the screeriing library are already known and tested and can be used to synthesize 
additional molecules of the same or similar molecular structural diversity. Since time is always 
of the essence, especially in exploring a newly discovered biological target, a rational follow 
up search through an optimally designed library of this invention permits homing in on crucial 
molecular structures directly and quickly. Not only does this procedure speed up the 
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development process, but it also avoids wasting the time and effort synthesizing and analyzing 
large numbers of compounds not in the neighborhood of the lead compound which would be 
errmeously tried prior to knowledge of this invention. 

Because the libraries of this invention have been constructed using two selection steps 
5 based on molecular structural differences, each step provides an opportunity to identify and 
explore compounds having similar structural features. 

A. Advantages Resulting From Product Filter 

Due to the way the final product molecules were selected for inclusion in the library, 
all compounds with a Tanimoto similarity of approximately 0,85 or greater to a compound 

10 already in the library were excluded. Therefore, the first place to look for compounds likely 
to have the same activity as the lead compound is in the group of all compounds in the 
onnbinatorial universe from which the lead was identified having a Tanimoto coefficient with 
respect to the lead compound of approximately 0.85 or greater. TTien, since each of these 
initial, compounds will also have an associated group of different compounds within 

15 ^iiHOximately 0.85 Tanimoto similarity of themselves, this larger group forms the second layer 
of what can be an expanding area of similar compounds to investigate. How far outwards ftom 
the lead compound the search is carried (each time searching widiin a Tanimoto coefficient of 
ajqyroximately 0.85) will be determined by the success of these additional compounds showing 
activity in the same assay as the lead compound. Thus, the library design itself identifies and 

20 permits a directed search for compounds ftom the utilized combinatorial universe most likely 
to have activity similar to the lead compound. The same procedure is followed if another valid 
metric, not the Tanimoto similarity) was used to create the library. Then all compounds mthin 
the ndghbortiood distance to a compound already in the library were excluded and the first 
place to look would be for compounds which fall within the neighborhood distance. The 
25 process is exactly identical to that followed using the Tanimoto descriptor. 
P, Advantages Resulting From Reactant Filter 

Two consequences flow from the selection of only one reactant from each cluster. First, 
combinatorial products containing that reactant may or may not be the most active with respect 
to any particular given biological screening test. There is no way to guarantee that the reactant 
30 that yields the nK)st active product will be selected from the cluster. For any reasonably sized 
clusto', the probabilities of finding the reactant that yields the most active product would not 
be greatly increased even if two reactants from that cluster were chosen, and, the size of the 
library to be tested would have been doubled. 
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However, the second consequence of selecting only one reactant from each cluster 
presents the flip side of the selection coin. Once a lead compound is identified, the library 
design immediately indicates from which div^se clusters the reactant molecules were chosen. 
All the other possible reactants (in the combinatorial chemical universe under study) 

5 rq>resenting similar aq;>ects of diversity are included in the clusters from which the reactants 
wexe chosen. For lead optimization, compounds cont^ning the other reactants ftom the 
identified cluster(s) can be synthesized and tested. The library design itself assures that the 
exploration of these reactants is likely to yidd compounds with similar activity to the lead 
compound. Thus the reactant selection process not only reduces the number of molecules that 

10 need to be screened, but simultaneously identifies the molecular structures which should be 
subsequendy explored to find the compound with the highest activity similar to the identified 
lead. No other prior art library design process provides so much information for lead 
optimization. 

C. Additional Optimizatio n Methods Usin g Valida ted Metrics 

IS The knowledge that a metric is valid, and what that implies for the metric ^>ace as 

discussed eariier, immediately enables methods for lead optimization not previously possible. 
In particular, knowing that a metric will define a design space where compounds with similar 
biological properties are found measurably near each other (the definition of a valid metric), 
now permits for the fir^ time the quantitative examination of the array of molecules used in 

20 any screening assay-to^etermine whether any molecules are measurably close to the identified 
lead compound. One aspect of tiiis approach has already been discussed in sections 9. A and 
9.B and certainly works best witf) an optimal library designed by the method of this invention. 
In addition, however, validated metrics will permit useful examination of any assemblage of 
compounds whether or not-the lead compound is identified from within the assemblage. There 

25 is no restriction on the source of the additional compounds to be examined and they may range 
from prior art screening libraries to chemical databases. Once a lead is identified, a validated 
metric would be used to map the lead and all other compounds in the assemblage to be 
exanuned into the metric ^ce; ie, the metric characteristics/values are determined for all 
possible compounds. For reactants (possible substituents) a metric validated on reactants would 

30 be used; For whole molecules, a metric validated on whole molecules would be used. Metric 
differences between the lead molecule and all the other molecules would then be calculated. 
All molecules with metric distances to the lead within the neighborhood distance of the 
validated metric should have similar biological activities. Again, if the metric distances fit)m 
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each molecule thus identified as falling within the neighborhood distance of the lead are then 
calculated vnth respect to all other molecules (excluding the lead and eadi other), a seccmd 
layer of molecules is identified which should have activity similar to the active neighbors of 
the laul molecule. Additional layers may be similarly identified and explored experimaitally. 
Impending on the structures involved, at least two layers would normally be explored. Thus, 
^ because validated metrics are now available, lead optimization will much less often be the hit 
or miss procedure characteristic of the fmor art. 

An extension of this procedure yields yet another major advance. In the prior art it was 
not possible to tell how for away from the lead On structural terms) one should explore in the 
search for a compound more active than the lead. In terms of the two dimensional activity 
island analogy of Figure 1, no procedure existed for exploring the shape or extent of the island 
of activity. Without knowledge of the island's shape and extent, not only was it impossible to 
know by how far a compound missed the island, but even when an active compound was 
found, it was also not possible to know if the island had been sufficiently explored; that is, 
Mrticthcr all compounds rq)resenting the range of diversity spanned by the activity island had 
beoi identified. In othw words, had everyplace been explored that should have been? , 

With the molecules identified by the expansion procedure outlined above, it will now 
be possible to map the island. Starting with molecules within the neighborhood distance of the 
lead, molecules would be synthesized and tested for activity. If all the molecules witiiin the 
neighborhood distance ("nearest neighbors") show activity, each still falls within the boundary 
of the island, and the next layer of molecules in the neighborhood distance expansion would 
be synthesized and tested. If only some of the nearest neighbor molecules show activity, the 
neighborhood radius of the lead must span an edge of the activity island, and only molecules 
falling within the neighborhood distance of these nearest neighbor active molecules would be 
included in the next layer of the expansion and synthesized and tested. Again, some of the 
newly tested molecules may show activity and some may not. This process of nearest neighbor 
molecule identification and testing should be rqjeated until no molecule in the next expansion 
layer shows any activity. The active molecules determined by this procedure will define the 
limits and shape of the activity island in terms of structural differences. 

The resolution obtainable with this procedure depends upon how well ttie structural 
diversity of the activity island is represented by the molecules in the original assemblage. That 
is, if only a portion of the activity island structural diversity is represented in the assemblage 
of molecules, that is the only part of the island which can be explored. Alternatively, perhaps 
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only the island's rough outline can be percdved. Within the constraints of the diversity present 
in the assemblage, exploration of the full extent of the island and of the space within its 
boundaries can be accomplished with the guidance of the validated m^c with vMch the 
island is mapped. To explore the island further it is only necessary to idoitify molecular 
S structures not included within the original assemblage with which to test the unknown territory. 
In some cases in order to distinguish particular structural differences, it may be necessary to 
consider additional sources of structurally diverse molecules and, perh2q>s, to map the lead and 
additional compounds in more than one metric space. Thus, possible structures can be 
proposed and examined with the validated m^c. If the proposed structures £sdl within the 

10 ndghboriiood distance of an active molecule, they can be experimentally tested. If those are 
active, further structures can be proposed and again examined to determine whether they fall 
within the neighborhood distance of the newly identified active molecule. If they do, they 
would be experimentally tested. Repeating this cycle of identification and testing will ultimately 
yidd a higher resolution map of the island and assure the searcher that the island has been 

IS thoroughly explored and no activity peak has been missed. 

The availability of validated metrics enables yet another method of rationally directed 
lead optimization from a knowledge of the structure of a lead molecule which was not 
identified from screening an optimally diverse combinatorial screening library. Essentially, the 
reactant screening process is utilized backwards to identify similar molecular structures, and 

20 then the product screening process is utilized to confirm structural similarity of proposed 
products to the lead. Two cases are important. The first involves lead molecules whidi can be 
synthesized directly firom reactants. In this method, the lead molecule would be analyzed to 
determine from what constituent reactants it may be synthesized. These reactants would then 
be characterized u^ng a reactant metric such as tc^meric CoMFA. Molecules in databases 

25 of potential reactants would be characterized using the reactant metric and searched for 
reactants falling within the neighborhood radius of each of the original reactants. The identified 
reactants will provide a basis for building proposed products having the same structural 
characteristics (diversity) as the original lead compound. However, before the product is 
synthesized, its similarity in metric space to the lead would be checked using a product 

30 appropriate metric to make sure that it falls within the neighborhood radius of the lead. 

The second case involves lead compounds in which substituent groups are bonded to 
a central or core molecule. The reactants which form the basis of the substituents as wdl as 
the core molecule would then be characterized using appropriate validated metrics. Again, 
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molecules in databases of possible reactants and core molecules would be characterized with 
validated metrics and searched for molecules falling within the neighboThood radius of each 
of the original reactants and core. The molecules thus idCTtified would provide a basis for 
building proposed pioducts mtti structural diversity similar to the lead compound. Again, 
5 before synthesis, the proposed products would be evaluated with an sqipropriate m^c to 
confirm that they fsdl within the neighborhood distance of the lead compound. 

Since it is known that molecules resulting from different chemistries and involving 
different constituoits oftai show activity in the same biological assay, it would be desirable 
to search as wide a range of molecules as possible when performing the searches outlined 
10 above to identify aAiitional molecules that are within the ndghbortiood distance of some lead 
compound. Clearly, when contemplating these procedures, it must be recognized that the 
universe of all acces^ble chemical substances, even under the constraints of molecular weight 
that characterize a useful drug, numbers trillions of structures. While such unprecedented 
directed searches are only now possible with validated metrics, until the discovery and creation 
15 of the virtual library discussed later, even with today's powerful computers, the practicality 
of such large searches depended m preorganizing the trillions of candidate structures in such 
a way that the vast majority of candidates could be excluded, to the greatest extent possible, 
at the start of the search. 

For instance, one such useful preorganization involves dividing the candidates into 
20 senes of molecules accessible by some common synthetic route, and thus describable in terms 
of a core and reactants. (Typically, the synthetic route used to cr^te the lead would be the 
first investigated and other sets of alternative routes explored secondarily.) A combinatorial 
SYBYL Line Notation (cSLN) affords a useful description of such a series of molecules. 

Molecules rqjresented by a cSLN would be considered for overall similarity to an 
25 active lead molecule in the manner discussed above. Using validated metrics, it is most 
efficient to: 1) first identify each of the individual lists of reactants within the cSLN with the 
most similar side chain within the active lead; 2) next, to consider the similarity of the "core- 
within the lead (the atoms remaining after the side chains are identified) to the non-variant core 
within the cSLN; and 3) then, if the "core" similarity is not so low that this series of 
30 molecules can immediately be excluded, to order the variation lists by similarity to the 
corresponding side chains within the lead. The advantage of such a partitioning and 
preordering by similarity is the ability to break off the search as soon as no remaining member 
of the series would be likely to be sufficiently similar. 
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As an overly ^mplistic example, consid^ the series of sixteen possible dihalogenated 
methanes which may be represented by a cSLN as: X2CH2Xl{Xl:F|Cl|Br|I}\ 
{X2:F|CI|Br|I},) If bromobenzene were the "active lead" and the dihalomethanes were the 
series to be coniudered, an appropriate metric that indicated the lack of similarity of the 

5 aromatic core of bromobenzene to the methylene core of the dihalomethanes would 
immediately eliminate all dihalomethanes without considering each of the sixteen individual 
possibilities. However, if ethyl bromide were the "active lead", an appropriate metric might 
show that the meUiylene and Xylene moieties were sufficiently similar to warrant 
consideration of the individual methylene dihalides, and preordering of the variation list might 

10 immediately lead to dibromomethane as the most similar dihalomethane to ethyl bromide (the 
first bromine atom being identical to the ethyl bromide bromine, and the second bromine atom 
probably being the most similar to the CH3 of the ethyl bromide). In this hypothetical example 
only one molecule instead of sixteen would need to be considered in identifying similar 
molecules most likely to lie within the same neighborhood as the lead. Within actual cSLNs 

IS (each possibly rqiresenting perhaps millions of structures by including more points of variation 
and many more and larger variations at each point), the speed enhancement obtainable from 
this searching strategy would be many orders of magnitude greater than sixteen. 

There may be other variations of the applications of the methods outlined above which 
are not yet recognized at the pr&s^t time since the concq)ts and applications of this invention 

20 are still so new. However, reasonable extrapolations/techniques of molecular discovery which 
follow from the disclosure of the present invention and, in particular, from the ability to 
validate metrics, are considered within the teadiing of this application. 
10. Merging Libraries 

The final selection (sampling) methodology of this invention has broader uses than yet 

25 described. So far, this disclosure has been primarily concerned with the design of a 
combinatorial screening library based upon dther sets of reactants or sets of reactants and 
central cores. Each combinatorial screening library based on these materials only explores the 
diversity of that part of the chemical universe accessible with those compounds. Unless as 
much of the diversity of the entire combinatorially accessible chemical universe is explored 

30 in a screening library as is possible, there is no assurance that a molecule possessing activity 
with respect to any particular unknown biological assay will be found. Clearly, the useful 
diversity of the combinatorially accessible chemical universe can only be explored widi as 
vnmy sets of reactants attached to a3 many cores as is possible. Stated slightly differently, 
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there may be large parts of the diversity of the chemical universe not explored by one or even 
a few combinatorial sdiemes. Thus, combinatorial screening libraries based on multiple 
reactants and multifde cores would be desirable. Just such libraries can now be created through 
tte use of the virtual library discussed later. However, even with screaiing libraries 
5 constricted with the method of this invention discussed above, the simple addition to each 
other of many such libraries will quickly increase the total number of molecules which need 
to be screened. Worse yet, since many of the possible reactants used for combinatorial 
synthesis with different cores have similar structures, and since many of the possible cores 
used for combinatorial synthesis may differ little from each other, it is highly likely that much 

10 of the same diversity is rq>resented to a greater or lesser extent in each of the libraries 
generated from these mat^ials. Simply combining the libraries would again result in 
oversampling of the same diversity space. It would clearly be more useful and economical 
(efficieht) in terms of time, money, and opportunity to use additional screening to explore 
different aq)ects of the diversity of the chemical universe. 

15 Another significant feature of this invention is the recognition that the neighborhood 

selection (sampling) crit^ also provides a method to combine combinatorial screening 
libraries to avoid this oversampling problem. Starting with an arbitrary first library, using a 
validated metric which can be applied to whole molecules, each molecule of a second library 
is added to the first library if the molecule does not fall within the neighborhood radius of any 

20 molecule in the first library as supplemented by all the added molecules from the second 
library. This process is continued until all the molecules in the second library have been 
examiiKd. In this manner, only molecules representative of a different aspect of diversity are 
added from the second library to the first. Each successive library is added in the same 
nmnner. The molecules in a final combined library formed from smaller libraries selected 

25 according to the method of this invention rqiresent diverse molecular compounds and have the 
optima] diversity which is desired of a general combinatorial screening library. However, even 
if the groups of molecules to be merged have not been selected by the methods of this 
invention, they may be merged according to the above procedure if first, a subset of each 
group of molecules is selected according to the product sampling method of the design process. 

30 This will insure that similar molecules within each group are eliminated. The resulting merged 
library will not be optimally diverse, but it should not redundantly sample the diversity present 
in the separate groups. 

The 2D Tanimoto fingerprint metric is useful in performing the library additions. The 
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2D Tanimoto similarity coefficient of each molecule in the first library to all molecules in a 
subsequent library are calculated. Each molecule of the second library is added to the first 
library if the molecule does not fall within a 0.85 Tanimoto coefficient (the neighborhood 
radius) of any molecule in the first library as supplemented by all the added molecules from 

5 the second library. As long as the metric used for sampling and end-point determination is 
valid Oias the neighborhood property)^ this selection method guarantees a combined library in 
which all of the accessible diversity space is reprcsoited with little likelihood of oversampling. 
An example of three prior art libraries not designed with the method of this invention which 
might be merged using the ndghborhood sampling criteria is shown in Figure IS. Figure IS 

10 shows the distribution of molecules plotted according to their Tanimoto 2D pairwise similarity 
of the Chapman & Hall DiOionary of Natural Products, Dictionary of Pharmacological Agents, 
and Dictionary of Organic Compounds (CD ROM Versions). It is immediately clear from 
Figure IS that simply adding the three libraries together would produce a combined library in 
.Whidi most of the compounds would be very similar to each other (Tanimoto similarities 

15 >0.8S). Further redundant similarity would be expected from a comparison of the similarities 
between the molecules in the three libraries! The position of the 0.85 similarity point to the 
bulk of the molecules in each library indicates that, most of the molecules in these databases 
would be excluded from a combined library formed by merging the databases by the procedure 
outline above. 

20 11. Other Advantages of Optimally Diverse Libraries 

There are additional benefits achieved by designing combinatorial libraries according 
to the method of this invention. For instance, as noted earlier, one of the difficulties of 
screaking several compounds simultaneously is the pos^bility of non-specific activity being 
detected due to the contributory effect of the combination of compounds. In fact, the litelihood 

25 of this effect is increased when compounds of the same molecular structural and chemical 
divCTsity are tested in the same assay. With the libraries of this invention, it will be possible 
to design the assay combinations so that only compounds representing different aspects of 
diversity are tested together. While this procedure can not guarantee that no combination 
effects will occur, it makes it much less likely. Another benefit achieved is that complex 

30 deconvolutions will generally be unnecessary. Deconvolution problems are accq)ted in the 
prior art as a necessary evil due to the enormous number of molecules which must be 
synthesized and screened since virtually all combinatorial possibilities are included in the 
libraries. Clearly, with smaller optimally diverse combinatorial screening libraries covering 
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the same search territory as the larger prior art libraries, it is possible with the aid of computer 
controlled robots and data bases to individually synthe^ze and track each compound. 

As mentioned at the beginning of this disclosure, the methods of this invention are also 
25>plicabte to problems outside the specific area of drug research. The noti<wi of choosing 
5 compounds based on diversity is a general concept with many 24>pIications and is applicable 
any time the problem is presented of having more compounds than can usefully be tested/used. 
The example was given earlier of determining what compounds had-the same structural 
diversity as a previously identified (biologically active) compound. Of course, with the 
meithods of this invention, tiie activity may be any diemical activity. In addition, the universe 
10 of chenucals from which only some are to be selected does not have to result ftom a 
combinatorial synthesis, but may result from any syntiiesis or no syntiiesis at all. An example 
of the later would be die solution to Uie question of selecting molecules of similar diversity 
from among those in a large corporate or catalog data base. In tiiese cases, an appropriate 
metric (remembering that different metrics arc applicable in different circumstances) would be 
15 am>lied to all die compounds and clustering would result in compounds of the same diversity. 
The methods of this invention, including metric validation, topomeric CoMFA metric 
characterization, end-point neighborhood sampling, lead compound optimization, and library 
design can all be applied separately and together to solve the selection problem. 
12. Virtual Ubraiv r^struction & SeariAinp 
20 The two step sequential design process for selecting optimally diverse product molecule 

libraries set out so far in tiiis application is necessarily computationally time consuming, 
limited to consideration of one set of synthetic reactions at a time, and eliminates at the first 
stage leactants which might be capable of generating products which would pass die product 
stage neighborhood filtering criteria. TTie process is computationally time consuming since, for 
25 any given set of reactants, Uie steric metric must first be computed, die resulting descriptors 
clustered, and a selection of reactants made based on the neighborhood rule. Only after ttiis 
first stage can the possible product molecules be determined, a second product metric 
calculated, and selection made of the final library members. 

The process is limited to one set of synthetic reactions at a time in the following sense. 
30 First, a particular organic chemical reaction scheme is identified as well as the core and 
possible reactants yfhich may be used in the scheme. Each sequential step of library design is 
sequoitially implemented and results in an optimally diverse library for tiiat reaction. For a 
slighUy different core which involves the same chemical reaction sdieme and die same 
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leactants, the entire process including ail calculations must be rq)eated. Each combination of 
core and reactants generates a different library. In the method of the above referenced patent 
a{)plication, the resulting libraries, individually derived, are then combined. This process also 
adds additional time to the assemblage of a laiiger optimally diverse library. Finally, the 

S product sts^e of the design is constrained by the reactant stage; that is, since it is desirable to 
generate as many diverse products as posable, some products may be sufficiently diverse (as 
confirmed by die product neighborhood metric) whoincreated from similar reactants (those~ 
failing within a topomeric neighboihood cluster) by virtue of the mere combination of the 
reactants into the products, and such products should be included in the library. 

10 In addition, con^deration of the above techniques of optimally diverse library design, 

lead optimization, and merging libraries all point to the distinct advantages of being able to 
explore the diversity of combinatorially accessible chemical universes using/including as many 
reactions, core, and reactants as possible. Thus, it was recognized diat, ideally, library design 
and lead optirhization would be most useful if all combinatorially accessible molecules could 

IS t>e meaningfully searched. The sheer number of molecules involved (trillions) would seem to 
suggest that even with today's fastest computers, such a library design and searching would 
be unachievable. However, using the power and utility of validated metrics, a way to oeate 
and search a data base conuining representations of products from as many combinatorial 
reactions and reactants as desired (a huge combinatorially accessible universe) has been 

20 discovered. This data base is essentially a virtual library of combinatorial products because, 
as will be explained below, all information necessary and sufficient to search across and 
construct all possible product molecules is contained within the virtual library even though the 
structure of each Qombinatorial product is not explicitly contained within the virtual library. 
The virtual library can be used not only to select screening libraries, to find molecules 

25 with similar struaures to a lead compound, to perform lead explosions, but, through the use 
of validated metrics, it can also be used to search for and select compounds likely to have 
similar biological or other physical properties from across the broader chemical universe. In 
fact, as will be sedi below, use of the virtual library opens up possibilities for searching the 
accessible chemical universe in ways not heretofore possible. 

30 With respect to the selection of screening libraries, it has been discovered that the same 

approach to design as previously described can be performed more efficiently and more exactly 
by combining the formerly separate steps of topomeric selection of reagents and Taiiimoto 
selection of products into one stsp which operates on the entire set of all possible products 
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from the leacdon under ccmsideration. Another advantage of this q>pioach is that generally a 
iBTger group of diverse compounds are identified; that is: the significant (active) metric space 
is sampled more extensivdy. Additionally, the method by which the maximally diverse set is 
selected can be modified to yield results which more readily suit the practical issues of 
S laboratory syndesis. As a consequence of this discovery, an efficient method for identifying 
molecules of interest from the billions of pos^ble pnxlucts obtainable from combinatorial 
syndieses has been discovered. Indeed^ use of the virtual library is ncH-limited to finding 
molecules derivable from known synthetic combinatorial reactions, but is generally applicable 
to molecular selection. As with the selection m^Kxlology discussed above, the ability to create 
10 and search the virtual library relies upon the power of the neighbortiood property of validated 
metrics to distinguish the similarity or dissimilarity of molecular properties between molecules. 

The creation of a virtual library using validated molecular descriptors enables methods 
to identify compounds of interest finom many possible compounds and is particularly a^Iicable 
to identifying compounds of interest from extraordinarily large numbers of compounds. The 

15 s^lication of these novel methods speeds the searching operation and in some ways extends 
the types of searching criteria which may be used. Most importantly, construction of a virtual 
Ubrary makes it possible to identify compounds of interest by an exhaustive search through all 
possible compounds from a series of known synthetic reactions - thus providing a c^bility 
Miiidi does not currentiy exist otherwise. In particular, the virtual library provides a large 

20 number and variety of ways to select a subset of compounds from a very large number of 
compounds. The number of compounds from which to make the selection is likely to range in 
the trillions of compounds, based only on known synthetic reactions and commercially 
available reagents appropriate for each reaction. 

TTie following disclosure of the method of constructing and searching a virtual library 

25 will be discussed wiUi respect to those compounds accessible through combinatorial syntheses. 
However, as noted above, tiie virtual library is not limited to such combinatorial compound 
universes and these universes are disclosed by way of an example of the methodology of tiie 
disooveiy, not a limitation thereof. 

The significant aspect of being able to create a virtual library using validated metrics 

30 is the ability to identify from tiie large universe of compounds those with related properties 
and/or structural characteristics without having to examine individual structures; in other 
words, to do structural searches wiUiout direcUy comparing Cooking at) structures. This is 
made possible by precalculating, as much as possible, characteristics for the component parts 
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of the product structures. Clearly, then, the beginning point for this method is the construction 
of a database, or "virtual library**, of possible chemical compounds, products, which can be 
synthesized from a common reaction. 

A. Derivation of the Database fVirtual Library! of Compounds 

S The database of compounds, *'virtual library", to which the method of this invention 

may te ^lied is an ass^bly of the combinatorially derived product structures resulting from 
any number of synthetic reactions. In initial applications tens^of reactions are used to construct 
the database (virtual library) of interest. The total number of possible produa compounds 
becomes astronomically large very quickly. For instance, there are approximatdy 500 
10 comm^cially available molecules having reactive diamino groups and approximately 15,000 
commercially available reactants which will react indq)endently with each of the amino groups. 
Combinatorially there can therefore be generated I5,O0OX 15,000X500 (1 12 billion) possible 
product molecules from this one reaction scheme alone. 

B. Overview of Methodology 

15 A fundamental part of the discovery of how to create and use a virtual library is a 

method to precompute properties based on 1 + N, + Nj + +,.. N^, structural variations 
which can be used to exacUy, or with useful degree of approximation, predict the 1 x N, x 
X X... Nm product structure properties which arise from all combinations of the structural 
variations about the 1 core at all M substitution sites. In the earlier part of this disclosure, the 

20 variable parts of a combinatorially derived molecule were referred to either-by reference to 
their source (reactants) or their molecular configuration when attached to the core (side 
chains). When discussing creation and searching of a virtual library, the more generic term 
"structural variations* is appropriate for the groups sqipended to a core. The reasons for 
adopting this term will become clear later during the discussion of searching the virtual library 

25 with re^>ect to non-combinatorially derived structures. 

Figure 16 shows in schematic form a rq)resentation of three sUiictural variations 
attadied to a central core. In Figure 16, each possible product structure arises from combining 
the core substructure with exactiy one of the N, choices in the set of structural variations {R,}, 
exactly one of the N2 structural variations in the set {R,}, etc. 

30 For many properties, such as molecular weight and price, or count of rotatable bonds, 

or number of H-bdnd donors and acceptors, the values associated with the product compound 
are exactiy the sum of the s^propriately created structural variations. 

For some properties, such as logP, the assumption of additivity is inexact but adequate 
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for the purpose of selecting a small subset from a very large number of possible products. 
For other properties, particularly the topomeric shape descriptor, the comparison of two 
product compounds* properties requires a decision on how to match each structural variations's 
descriptor in the first product to one structural variations *s descriptor in the second product 
5 such that each structural variation is referenced exacdy once. 

There are also some properties (such as molecular fingerprints) \^ich are representative 
of the whole combinatorial product molecule and can not be iq>resented by the sum of the- 
constituent structural variations. The method for deriving these properties will be discussed 
bdow. GenOTlly, however, by this method a virtual library containing descriptions of the 
10 structures of all possible combinatorially generated products can be created from a knowledge 
of the properties of the structural variations. 

C. Overview of Virtual Ubrarv Construction 

Initially information on the reactions to be included and the reagents which may be used 
with those reactions needs to be gathered and entered. In addition, die reagents need to be 

15 converted to thdr corresponding structural variations. The ovotII process of virtual library 
construction is summarized in the flowchart of Figure 17. The first step in the creation of the 
virtual library is to create for each possible structural variation (variable part) a file containing 
various parameters/characteristics associated with that structural variation. Typically the file 
may contain information on die price, source, availability, MW, and logP. In addition, the 

20 metric characteristics for Uie structural variation resulting from the application of validated 
metrics to tiie structural variation structure are included in the file. Other characteristics which 
might be used for searching may be added to the file. Similar files are created for core 
strucUires. As with the earlier discussion of designing optimally diverse libraries, any validated 
metric may be chosen to characterize the structural variations or cores. For purposes of 

25 discussion of the virtual library, the same metrics, topomeric CoMFA and Tanimoto 
fingerprints, will be used as in die examples earlier. 

The second stq) in creation of the virtual library is a description of Uie chemical 
transformation rq)resented by the chosen chemistry. The virtual library is tiien created by 
combinatoriaUy combining all structural variations in the chemical transformation to generate 

30 virtual library descriptions of all possible product molecules. 

Substantial effort is required to produce the representation of the structural variations 
forming the database from a given reaction. The software provided as Appendix "E" and 
Appendix "F" to this application is used in conjunction with the commercial software products, 
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Selector and Legion, to compute properties of the structural variations and to combine two or 
more such lists of structural variations along with a core structure to produce the representation 
of all possible products. 

Particular skill is required to convert the chemist's description of reaction conditions 

S and reaction validation into a set of sdection crit^ia allied to a database of available 
reagents, by which only those reagoits which are actually iikdy to yield the desired product 
in the spedfic reactionxonditions are included. (Here "reagents* refers to chemical starting 
mat^als which undergo reaction to produce the products. A reagent corresponds to a molecule 
used in a structural variation in the method, after some rearrangement of bonds.) Additionally, 

10 methods for automating chemical judgment to derive the list of reagents and to compute the 
properties such as the topomeric shape descriptor have been developed. Finally, a key concept 
in constructing the virtual library is to organize the process of library definition so that it 
depends on a relatively small number of parameters which can be stored in a table so that each 
row in the table defines all tiie information that is necessary to specify a combinatorial library. 

IS While the following discussion addresses formation of the virtual library in terms of chemical 
transformations, cores, and reagents and/or structural variations which may be used, it should 
be appreciated that data in the virtual library may be generated by any cores and structural 
variations as long as the resulting compounds can be described by a cSLN. Thus, even product 
molecules which can not t^e synthesized by a known combinatorial reaction can be included 

20 in the virtual library and their structures-searched, 

p, Virtuambrary (;:Qn$trgctiQn 

The first phase of construction of a combinatorial library to be included in the virtual library 
takes as input a description of the chemical transformation rq)resented by that combinatorial 
library and a list of available reagents and^iroduces as output all the part structures (a/k/a 

25 structural variations) found in the list of available reagents which are appropriate for the 
chemical transformation, along with all structure-invariant physicochemical properties of those 
fragments that might be useful in different types of subclass (subset) searches. As is apparent 
from the earlier discussion, the same general and biologically based elimination criteria can 
be applied to the proposed structural variations before selection of the structural variations for 

130 inclusion in the virtual library. Alternatively, structural variations which would be eliminated 
by the general or biologically based criteria can be flagged but still included. Having the 
structural variations flagged, few potential product structures are eliminated from the virtual 
library, but the products containing particular types of undesirable structural variations can still 
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be removed during selection. 

In the course of this process, data are entered and recorded permanently into three 

tables: 

REACTIONS (a Molecular Spreadsheet) = information about a reaction scheme. Each 
5 record corresponds to a reaction. A typical reaction would be: "reaction 

of each nitrogen of a diamine wdth various reagents such as adds 
(acylation) or ketones (reductive amnination)". 
REAGENTS (a Molecular Spreadsheet) - information about a particular set of 
reag&its used in some instance of a reaction. Each record corresponds 
10 to a particular logical reagent structure search in a database of such 

reagents, presumably a set of reagent structures which will all react in 
the same way. For example, there arc sixteen reagent records for the 
diamine reaction, enumerating each of eight reactant classes that .might 
react with each of the two nitrogens. One record for example describes 
15 a reaction with epoxides, that could be ring opened nucleophilically 

(and r^ioselectively) by an amine to yield a beta-amino alcohol. 
RDATA (an Oracle Table) = invariant physicochemtcal data computed about 
structural variations, typically the varying portions in a CSLN. with one 
record for each structural variation encountered in any cSLN 
20 constructed. Thus data need not be recomputed when such structural 

variations arc reencountered, a substantial savings in processing time. 
For example, records will be added describing the properties of a 
-CH2CH(OH)R chain (structural variation) for each (new) epoxide-R 
reagoit retrieved by the example record just given for tiie REAGENTS 
25 spreadsheet. 

Entering a new reaction into the system involves inputting tiie data for a new row to 
REACTIONS and at least two new rows to REAGH^TS. This data entry operation is ttie only 
required data entry in preparation for virtud library production. 

AU these operations of table preparation arc carried out by tiie SPL script getacd.core 
30 (Appendix E) and executed within the commercially available software product SYBYL. The 
code for producing the topomeric CoMFA field descriptor of each structural variation is 
provided as Appendix F, CTOPS. 
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i Representati on of the Database of Compounds 

The virtual library database of compounds for any one synthetic reaction is rq)Fesented 
as a set of chemically bonded (connected) structural variations where the connecting dements 
may anisic of a common core (one or more atoms which are identified in all members of the 
5 set). More than two variable sites may be involved. The list of structural alternatives thmfore 
contains two or more elements, each of which rq>resents a specific molecular fragment and 
~a number of assodated molecular properties. Table 6 and Table 7 below are produced by 
getacd.core. For each combinatorial scheme a set of files is gaoerated, For a di-substitution 
scheme the first file defines the combinatorial scheme, and the second and tiurd files describe 
10 the structural variations which can be utilized at the two sites. For a tri-substituted sdieme, 
there will be a set of four files: the first defming file, and three additional files describing the 
structural variations for each of the three sites. The number of files in each set of files is 
clearly determined by the combinatorial scheme invdved. 

In Table 6, the information following #@CORE describes the core, the information 
15 following #@CONNECTOR describes the location of attachment of each of the two varying 
sites, and the j^@QUERY line shows an example of how the list of structural variations may 
be q)ecified. Essentially this QUERY describes how to combinatorially construct product 
molecules out of the structural variations and is used after searching of the data base is 
complete to generate actual product structures. 

20 TABLE 6 

S^mplfi pSLN File 

#SYBYL/3DB HTTLIST 

# Created: Date Time 

#@CLASS STRUST 
25 #@DATABASE NONE 

#@SOURCE VDB_BUILDER 

#®SUPPUER 

#@PRICE 

#®FCD 
30 #@MW 85.062 

#®LOGP -1,05 

#@CORE X1C(=0)CH2NHC(-0)X2 
#@CONNECTOR 1,X1=2;1I,X2=9 
#@QUERY 

35 Y_01C(=O)CH2NHC(=O)Y 02{Y^02:FC(F)(F)C[5]:C:C(:CH:CH:CH:@5)C(F)(F)F<V 
=6>}\ 

{Y^01:FC(F)(F)C[5]:CH:C(:CH:C(:CH:@5)OCH3)NH< V=19 > } 
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ii. Application of A First Metric rTopomeric CoMFAl 
Table 7 shows the format in which the structural variations for the first variable site 
are listed, including both the structure in Sybyl Line Notation (SLN) and a set of rdated 
properties sudi as SUPPUER, PRICE, mdecular wdght MW, estimate of hydrophobicity 
5 LOGP, and a fidd, CTOPS, which in encoded form represents the novel shape descriptor, the 
topomeric fidd (die steric fidd of the topomeric conformation) for the correqxMiding structural 
variation. Information-on only two possible structural variations is shown. For the diamino 
example above, this structural variaticm file would contain all of the structural variations which 
react with an amino group, sqjproximatdy 15,000 entries. 



10 im£2 

Structural Variations At First Site 



FC(F)(F)C[5]:CH:C(:CH:C(:CH:@5)OCH3)NHR1<FCD=TRIPOS_0393;PRICE=101.4 
;SUPPLIER« ALDRICH;MW= 190. 14;LOGP=2.33;CTOPS= 1 1 1 1 H 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 

IS iiiiiniiiiniiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiininiiiimuiiiiiiiuiiiiii 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiniiiiiiiniiiiiiiiiiuiniiiiiiii 
iiniiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

1111111111111111111111111 11111111111111111111 11111 lllIllMlllllUlllllllllll 
Illlllllllllllllllllllllllallllllll3f2111111114111Illllllllllll]lllMlllllll 
20 11111111111111111111111111111111111111111111111112fllllllllffelllllll4ffllllll 

iifDiiiiiiizfliiiiiiiiiiiiiniiiiiiiiiiiiiiiiiiiiiiiiiiiiniiin^ 
iiisfoiiiui2f]f2iiiiiiiffniiiiiifr4iiiiiiiiiiiiiiiiiiiiiii]iiiiiiin 
Iiiiii2iiiniiifffiiiiiiifff2iiiiiifrniiiiiifffiiiiiii942iiiiiiiiiii 11111 iiu^ 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiffiiiiiii4ffiriiiiii7iiiiiiiiiiiiiiiiiiii 
25 iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiiiiiiiniiiiiii 

lllllllllllllllllllllllllllllllllllllllllllllIllllllUllUllllIlllllllllllll 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiii> 

FC(F)(F)G(51:CH:C(:CH:C(:CH:@5)C(F)(F)F)NHR1 <FGD=TRIPOS_0394;PRICE= 14 
.84;SUPPLIER=ALDRICH;MW=228. 12;LOGP=3.32;CTOPS= 1 1 1 1 1 1 11 11 1 1 1 1 1 1 1 U 1 1 
30 llllllllllllllllIlllllllllIlllllllllllllllllllUlliilllllllllMllllllllUlll 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiniiiiiiiiiiiiiiiiiiii 
iiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiinuniiiii 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniuuuiii 

llllllllllllllllllllllllllllallllllI13f211111111411111111111111111Mllllllll 
35 nilUllllllllllllllllllllllllllllllllinilllllllll2fllllllllffelllllll4fnill 
llllfDI1111112fllllllllllllllllllllllllllllllllllllllMlllllllllliiiiiii2ffll 

Iini5miiiiii2flf2iiiiiiifffiiiiiiiff4iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuii 

lllllllllllllllllffflllllllfffllliniffflllllllffflllllll94211111111llliiiilllll 
Illilllllllllllllllllllllllllllllllllff21111111ff411111113fllllllllllllliniii 
40 1111111111111111111111111111111111111111111111111111111111111111111111211111 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiiMiiiiiiiiiiiiii 
lllllllllllllllllllllllllllllllllllllllllllllllll]llll> 



wo 97/27559 PCT/US97/01491 

84 

A second file similar in appearance to that of Table 7 which lists all the structural 
variations which may occur at the second site is also created. 

iii. Atrolication of A Second Metric (Tanimoto Fingerprint) 
The overall process of flying the Tanimoto fingoprint metric for use in the 

S virtual library is summarized in the flowchart of Figures 18, 19, and 20. As mentioned 
above, certain prq)erties (molecular descriptors) of tfie product molecules can not be 
simply computed as the sum of the assodated properties of the substructures used to form 
the product molecule. One of the most important and challenging to compute of these 
molecular descriptors is the molecular fingerprint. This product descriptor can not be 

10 calculated as the ^mple additive results of the descriptor of its pieces. For fingerprints, any 
fragment which is not fully contained within the core alone or within one structural 
variation alone will not be represented by treating each piece separately. Therefore, a 
fingerprint descriptor is computed for an extended core consisting of the structural variation 
at site R| and including the substructures which consist of: 

IS 1) the structural variation; 

2) the common core substructure; and 

3) ail invariant atoms contiguously connected to the core occurring in structural 
variations at sites other than R,. 

Hiis process is repeated for all sites. 

20 Thus, in Figure 16, if each selection in {Rj} includes an OCH2 group connected to 

the core and each selection in {Rj} contains a CH connected to the core, the fingerprints 
corresponding to a selection from {Rt} will describe the substructure formed by this 
selecticm connected to the core and also including an OCH2 connected to the core at site 2 
and a CH connected to the core at site 3. 

2S For the standard definition of 2D fingerprints, this method can yield an exact result 

of the product fingerprint whenever the shortest connected padi through the extended core 
is S atoms or more by OR-ing (a Boolean algebra manipulation) the fingerprints of each of 
the 3 structural variations in the example above. There is no need to include a sqiarate 
fingerprint for the core, since it is contained in all the structural alternative descriptors. 

30 There is no hazard of duplication, since a fingerprint with a few exceptions notes only the 
presence of a connected fragment, not the numb^ of occurrences. That is; either a bit is 
set in the fingerprint for that structure or it is not set. Duplicate occurrences of the same 
structure can not set the bit twice. In the few cases, such as ring and halogen structural 
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features, where a count is maintained, correction for these bits of the fingerprint may be 
accomplished by explicit correction by count of structural variations plus core. 

In some cases the extended core is not large enough to assure exact construction of 
the product fingerprint from that of the pieces (i,e. some relevant fragments start in one 
5 structural variation, span the extended core and reach into the individual alternatives at 
anodier ate). To create and explicitly fingerprint every compound is in fact possible for a 
set of one million products. For the creation of a virtual library with initially tens of 
millicms of products and ultimately hundreds of millions and even hundreds of billions of 
product compounds, explicit fing^print computation is not feasible in any realistic time 
10 frame. For this scale of virtual library creation an approximation is both acceptable and 
necessary. Finally, since the purpose of the creation of the virtual library is to ptx>vide a 
basis for searching for molecules matching some subset criteria, the approximation method 
must ensure that such searches are reliable. 

For the approximation, a random sample of a statistically significant fraction 
15 (typically for a very large virtual library, 0.001) of the products is taken. Each sample 
product is checked to see how many bits are in the product but not in the fingerprint 
composed from the pieces. The largest observed difference value, MBITS, is maintained 
for future calculations and is used to identify, for example, all products which might be 
similar to a given structure in the extreme case in which all MBITS missing bits were in 
20 fact those which would make every product most similar. 

The Tanimoto is defined as (#bits in common) / (#bits in either) for the similarity of 
two compounds' fingerprints. In the case at hand, the estimated product fingerprint might 
have as many as MBITS bits which are actually present in the product fingerprint but 
missing from the estimate. In the worst case, every one of those bits would be in common 
25 with the bits in the query compound's fingaprints. Since Tanimoto = (#bits in common) / 
(#bits in either), in our worst case this is (apparent #bits in common + MBITS) / (#bits in 
either), since every one of the MBITS bits is already represented in the #bits in either but 
is not present in the apparent gbits in common (i.e. the Mts in common based on the 
estimated product fmgerprint). 
30 By adopting this approach, an upper bound is calculated on the largest possible 

Tanimoto between two compounds. The actual product fingerprint cannot yield a higher 
Tanimoto than this, and almost always yields some value between the apparent Tanimoto 
and the upper bound. In some cases this estimates the largest possible Tanimoto to be 
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greater than the actual maximum of 1.0; it serves no purpose to correct for this! 

An example may be useful. Details of the computations are provided in the attached 
code, dbcslnprepro, but to illustrate the concept assume that what is desired is a subset of 
compounds defined as those with a Tanimoto similarity of 0.80 or high^ to a specified 

S reference compound. By the m^ods of this invention the fingerprints of every one of the 
2000 structural variations at two sites (1000 each) have hem precomputed. An estimate can 
be made of the fing^rints of every one of the 1,000,000 possible products by OR-ing the 
two site's fingerprints for every selection of one from each site. For a specific possible 
product the number of common bits is 78 and the of bits in dther" is 100, so that the 

10 qiparent Tanimoto is 78/100 which is below the cutoff of 0.80 and the product would not 
be selected. However, if the MBITS is 3, then the worst case could have 78+3—81 bits in 
common out of 100 bits in dther, and the largest possible Tanimoto would be 81/100 
whidi is greater than the cutoff. If it is desired to err on the side of not missing any 
possible products, this value would be accepted even though the apparent Tanimoto is too 

IS small. 

The results of the fingerprint calculations discussed above are added as two 
additional fields to the structural variation files: fjpcard and fp, which together represent the 
two-dimensional fingerprint of the structural alternative and everything to which it is 
connected in all of the resulting products; this additional structure being needed to more 

20 folly rq>resent the fingerprint of a prpduet compound by that of the structural variations 
which combine to form it. At the minimum, the common structural portion by which the 
alternative's structure is augmented is that of the core. Appendix G contains the code 
dbslnprepro which calculates and adds fpcard and fp. 

Whien the fingerprint terms, fpcard and fp, are added to the file structure shown in 

25 Table 7, the complete file format for each structural variation follows the form: 
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TABLE 8 

FC(F)(F)G[5]:CH:C(:CH:C(:CH:®5)0CH3)NHR1 <FCD=TRIPOS_0393;PRICE=101.4 
;SUPPUER=ALDRICH:MW= 190. 14;LOGP=2.33;CTOPS= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

iiiuuniiiuiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiiiuiiniiiiiiiiiiuiiiniiiiii 
s iiiiiiiiiiiiiiiniiiiiiiiiiniiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiiiiininiiiii 
uniiiiiiiiiiiiiiuiiiniiiiiiiiiiiiniiiniiiiiiiiiinuiiiiiiiinin 
iiiiiiiniiiiiiiiniiiiiiiniiiiiuuiiiniiiiiiiiiiiiiiiiiiiniiiiiiiuiiii 

Illllllllllllllllllll]llllallllllll3f21111111141111111111111111111111lllliii 
llllll]llllllllllllllllllllllinilllllllinillll]2fllllllllffelllllll4ffl-14111 

10 iifi3iiiini2fliuiiiiniiuiiiiiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiiii2ffnii 
Iiisfl3niiii2ff2iiiiiiifffiiiiiiifr4niiiniiiiiiiniiiiiiiiiiiiiiiiiiiiiiiiii 
iiiiii2iiiuiiiimiiiiiifH2iiniif}nninifmiiini942iiiiiiiiiiiiiiiiiiiiii 

lllllllllllUlllllllllllllllllllllllfflllllll4ffllllllll711111111111linillll 

niuniiiiiiniiiniiiiiiiiiiiniiiniiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiinii 
15 iiiiiuiiiiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiniiiiiiiiuiiiii 

nillllllinilllllllllinnilllllllllllllllllllllllll;fi)card=141;fp=0800002G 
20000O8Oa2O8O0O88O4O8G48IO0O000OD80O328Oc42alOlO00000O100fB88O44011824c8O9000 . 
400O2O0O8O000e00880O42O2O40O281O0d060000Ol 120l0a80000400l 1 1 800000c2 184c0060a8 

061804«00018102000000000200000024812010a024008c800040l0000052000011847eOcOOG 
2a 38e7cl0100> 

FC(F)(F)C(51:CH:C(:CH:C(:CH:@5)C(F)(F)F)NHR1 < FGD =TRIPOS_0394;PRICE= 14 
.84;SUPPLIER=ALDRIGH;MW=228. 12;LOGP=3.32;CTOPS =111111111111111111111 
lllllllllllllllllllllllllllIlllllllllllllllllllllIinilllllllllllllllllMlll 

iiinviiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiin 

25 lllllllllllllllllllllllllllllllllllllllllllllllllIlllllIlllllllllUUlllllll 

llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll 
iiiriiniiniiiiiiiiiiiiiiiiaiiiiiiii3f2iiiiiiii4iiiiiiiiiiniiiiiiiiiii 
niiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiinii2fniiiiiiffeiiiiiii4ffiin 
iiiifDiiiiiii2fuiiiiriniiiiiiiuiiiiiiiiiiiiiiiiiiiniiiiiiii^ 

30 111115fflllllll2ff21111111ffflllllllff411innillllllllliiiiiiiiiii 

niiuiHinuiiifmiiiiiifffiiiinifffiiiniifffiiii 11194211111111111111111111 

llHlllllllllllllllllllllllllllllllllff21111111ff411111113fllllllllll 111111111 
1111111111111111111111111111111111111111111111111111111111111111111111211111 

iiiiiiiiiniiHiiiiiiiiiiiiiiiiiiiiiiiiiiriiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

35 lllllinilllllllllllllllinillllllllllllllllllllllllll;fjxaid=121;lp^ 

02q00008O0008q00880008O48000000000800328O442a0O1000000O10Of0O^ 1024c80900 
04000200080000e()0800042000400281Q000000000010012880000400lll^^ 

880618048000101020000a00002000000208120108020008480004000000042000001847c0c0 
003cfif810100> 



40 When initially constnicted the virtual library consisted of the files described above. 

However, since the fingerprint metric is calculated for each set of structural variations 
attached to a specific core, separate structural variations files containing the fuigeiprint data 
were required for each combination of core with the structural variations. The virtual 
library therefore contained a great deal of redundant data (structural variation files 
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rq)etitive]y containing the same non-fingerprint data). Accordingly, a more efficient virtual 
library is constructed by locating the fingerprint files associated with each structural 
variaticm file and different cores in separate files. Thus, only one copy of eadi structural 
variati<Hi file (like Table 7) is required, and there is an associated fingerprint file containing 

S fpcard and fp for every core with which the structural variation file is used. The virtual 
library keqis track of all the individual files in a master file. For instance, on one line of 
the master file is kqit the information that the Table 6 file is associated-widi its appropriate 
structural variation files and fingerprint files. Each line of the master file relates one Table 
6 like file (CSLN) file with the appropriate structural variation files and fingerprint files. 

10 The same structural variation files may now be used with more than one cSLN as long as 
the same type of diemical reaction is involved. Appendix G contains the code dbcslnprepro 
(a/k/a "power") which calculates fpcard and fjp, writes the fingerprint files, and updates the 
master file. 

Clearly, the data associated with each structural variation in each file can be directly 
IS expanded to include the results of the application of any other validated metric to the 
structural variation. 

iv. Summary of Method & Scope of Chemistry 
Creation of a virtual library of structural variation flies along with one definition file 
is all that is needed to describe all the products of a combinatorial synthesis, that is; all 

20 possible products of the combinatorial-synthesis are now described using only descriptors of 
the structural variations. As many additional combinatorial synthesis may be added to the 
virtual library as is desired. Clearly, the larger the number, the more comprehaisive will 
be the universe of accessible compounds which can be searched. In this manner the N| x 
X Nj X.... number of products may be analyzed using only the N, + + Nj +... number 

25 of structural variations. This ability to search a geometrically large number of product 
structures by searching through only the arithmetic sum of their parts is the key feature of 
the virtual library and is possible because of the identification and use of validated 
descriptors possessing the neighborhood property. Clearly, this same method is equally 
a^licable to any large assembly of compounds not derived from a combinatorial synthetic 

30 scheme which can be described as combinations of structural variations. Any number of 
additional fields containing information about the structural variation may be added to the 
file format, and may be meaningfully used as part of the search criteria for subset 
selection. 



wo 97/27559 PCT/US97/01491 

89 

There is special merit in assuring that each product which a user may select from 
this database (virtual library) corresponds to a known synthetic route and known available 
reagents. However, the routines which the user an>lies to select subsets of the virtual 
library, described below, do not depend on this. Neither does the represoitation itself 
S inherently dqiend on the assumption of known synthesis pathway. Hierefore it can be 
zpp'&oi to any situation in which the set of compounds of int^est can be expressed 
concisely as a core and pdnts of enum^ated structural variations. This makes the scope of 
the m^hod, in principle, cover virtually all of small molecule chemistry. In the limit, any 
molecule is divisible into such a rqiresentation where there may be only one ''structural 
10 variation" known in each list. In fact, the practical advantages of the invention will only 
obtain when the number of structural variations is large. 
E. Searching the Virtual Library 

The techniques of constructing and searching the virtual library present the 
molecular researcher with powerful methods of discovery not previously possible and 

15 rq>resent another major advance in the state of the art. Since the virtual library is 
constructed for purposes of flnding molecular similarities in structure and function, a 
unique feature of the virtual library is that you can ask questions of similarity in two 
fundamental ways - providing, essentially, two sides of the same coin: The first way is in 
the ctesign of screening libraries - subsets of the virtual library where what is sought are all 

20 those product molecules meeting some set of similarity criteria and not their structurally 
and/or functionally equivalent neighbors (as illustrated in Figure IB). The second way is in 
expanding on a lead compound (lead explosion) - subsets of the virtual library where what 
is sought are all those product molecules meeting some set of similarity criteria to the lead 
and all the structurally and/or functionally equivalent neighbors. Cleariy, as a given line of 

25 inquiry is followed, the search for the desired subsets may, at any given levd of detail, 
tike on aspects of one or the other of these two methods of inquiry. For instance, a search 
for all product molecules matching a lead compound may result in 10 million possibilities. 
In <mier to make the synthesis and actual screening more efficient, out of these 10 million, 
a screening library may be selected which does not sample the same neighborhood space 

30 more than once. This ability to perform different types of similarity searches underiies the 
discussion which follows. 

Any of the characteristics associated in the virtual library file with each structural 
variation may be searched separately or in conjunction with other characteristics. Since 
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vafidated medics are used as descriptors for each structural variation, it is possible using 
only the data contained in the structural variation files to quickly idoitify those product 
molecules \^di could be formed from the structural variations similar in structure and 
biological activity to known molecules (such as lead compounds) or arbitrarily chosen 

5 molecules (screening libraries). With the virtual library, a stnictuial search can be carried 
out without having to actually generate and ccmipare any explidt structures of any possible 
product molecules. Subs^ libraries~(screening libraries) representing molecules with 
selected characteristics can thereby be directly created by a seardi of the virtual library, 
and prcxluct structures created and generated only for those molecules included in the 

10 subset library. It is important to understand that the virtual library can be formed from any 
number of combinatorial synthetic schemes or can include molecules which, while iH>t 
based on a combinatorial synthetic scheme, may be expressed in the form of a cSLN. 
Methods of including and searching such molecules will be discussed below. Not only does 
the discovery of a way to create the ^a^tual library make it possible to search an 

IS extraordinarily large number of possible molecular structures, but it also makes it possible 
to do the searching in an extremely efficiently manner and in a very short period of time. 

Since a variety of data associated with each structural variation, including that 
resulting from the application of validated metrics, is stored in the virtual library, the range 
of questions (searches) and the types of answers (subset libraries) one can ask of and 

20 recdve from the virtual library is virtually unlimited and the number of possible product 
molecules examined to answer the questions is extraordinarily large. As emphasized earlier, 
the virtual library assodates precomputed metric values with each structural variation. 
Library searching is based on the discovery that the metric charactmstics of product 
molecules can be usefully estimated by the metric values of the structural variations used to 

2S finrm the products. As has been seen above, in the case of the Tanimoto fingerprint, it was 
also necessary to take into consideration in preparing the precomputed metric values some 
estimation of the core structure. For topomeric field searching, a useful method of 
comparison involves taking the root mean sum of squares differences between the metric 
field values of one structural Nation and anoth^. This value can then be compared to a 

30 chosen neighborhood distance to determine similarity. Finally, it should be recognized that 
in discussing core structures used in combinatorial arrangements, for purposes of creating 
and searching the virtual library, it is possible to consider a singe bond as a core structure. 
In such a case, the structural variations would be combinatorially combined across a single 



wo 97/27559 PCT/US97/01491 

91 

bond. 

As presently implemented by the inventors, the virtual library has to date 170 billion 
pos^ble product compounds rq>resenting 70,000 combinatorial reaction schemes over 
various cores, and it is bdng expanded mondily. Hie she^ size of the virtual library 
S suggests diat search times must be similarly enormous. However, using the search 
m^hodolQgy described bdow, made posdble by the ccmstniction of the virtual library 
based m validated metrics, real world seardiing rate? of^reato- than-200 - 500 million 
compounds per hour have been routinely achieved with a single imcesson Higher rates are 
achievable on a parallel processing computer with multiple processors such as are now 

10 available from several vendors including Silicon Graphics, Inc. 

i. Example Search Routine of Virtual Lib rarv - Tanimoto Similarity 
A brief overview of a typical search utilizing 2D fmgerprints (a validated metric) 
will highlight the general approach used for all searches of the virtual library, which at 
their most fimdamGital level, rely on the values of tiie neighborhood distances found for 

15 the validated metrics. The overall process of using the Tanimoto fingeq)rint metric to 
search for molecules is summarized in the flowchart of Figures 21, 22, and 23, A typical 
library based on the combinatorial synthetic scheme utilizing a reactive diamino core will 
be used again as an example. As noted, this synthetic scheme alone contributes 
approximatdy 112 billion compounds to the virtual library data base. The question typically 

20 presented will ask wh^er the virtual library contains any molecules having a structure 
lifady to yield a biological activity close to that of some known compound. To complete the 
search nothing need be known about the actual chemical compound for which close 
structures are desired, provided a 2D fingerprint for the molecule is supplied. Of course, 
graerally, the molecular structure of the known molecule is provided and the software 

25 calculates the 2D fingerprint. A particulariy important consideration is that the known 
molecule need not have resulted from a combinatorial synthesis and can, in feet, have any 
pos^ble structure. The searching method of this invention indq)endently searches each set 
of associated files generated by the virtual library construction method of the invention; in 
the case of the diamino example, a set of three files as outlined earlier. The reason each 

30 must be searched independently is that the searching program utilizes a knowledge of the 
number of sites (at which structural variations occurred in the synthetic scheme) to analyze 
the closeness of structure to the test molecule. 

Based on knowledge of the neighborhood property of the validated Tanimoto metric. 
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any molecule falling within a neighborhood Tanimoto similarity of 0.8S of anoth^ 
molecule should possess similar structural and biological characteristics. For this example, 
a Tanimoto similarity of 0.85 provides the basic selection criteria for ouunining the virtual 
library data base. Continuing witii the example above, the fingerprint of the known 

S molecule would first be compared to the fingerprint contained in ev^ structural variation 
occurring at each of the two sites (2 x 1S«000). The method d^mnines how many of the 
~bits sa by the known molecule would-be set by each ^ructural variation. For all 15,000 
choices at varying site Rj (all 15,000 structural variations at R,) the method compares the 
known molecule's fingerprints to the structural variation fingerprint. The same is then done 

10 for all 15,000 structural variations at site R2. Then, for eadi one of the 15,000 choices at 
varying site Ri the number of the matching bits set by that structural variation is added to 
the number of the matching bits for each one of the structural variations at Rj. For the 
entire set of structural variations at Rj and Rj, this involves only the integer addition of 
15,000 X 15,000 terms' and may be typically accomplished within fractions of a minute. 

15 As each addition is completed, the resulting sum is compared to the Tanimoto 

ndghboriiood criteria. Suppose 100 bits were set by the known molecule. If the sum of bits 
totaled 65 and the neightwrhood Tanimoto criteria of G.8S (85 out of 100) were used, it 
would not be possible for any combination of those structural variations to form a molecule 
which would closely match the structure of the known molecule. 

20 As noted above, the method also provides a check (MBFTS) on the approximation 

routine used to calculate the fmgeiprints of the product molecules which would be formed 
from the two structural variations at sites Ri and Rj. In this example, a typical MBITS 
vali^ of 4 is assumed. Adding the 4 MBITS to the 65 only yields 69 which is cleariy not 
within the required degree of Tanimoto neighboriKxxl. However, had the bits from the 

25 structural variations added to 82, then the addition of the MBITS 4 would yield a total of 
86, and the molecule formed from those structural variations would be considered close 
enough to check further. To confirm a match, the fingerprints from the two structural 
variations involved are OR-ed (Boolean) so that commonly set bits are counted only once 
and then compared to the fingerprint of the known molecule. Only if the resulting number 

30 vfhm added to the MBITS term is greater than 85, is the product molecule rq)resented by 
the two variations considered a match and included in a subset library resulting from the 
search. While these additional calculations take extra time, it is only necessary to perform 
them on structural variation combinations which pass the first level of screening (set bits > 
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85). Therefoie, typically only thousands of extra additions need to be calculated instead of 
nuUions, and the method is fast* By the method of this invmtion hundreds of millions 
of possible compounds may be searched within a couple hours of computer time. 

This testing procedure is continued through every set of structural variation virtual 
5 library files. Different sets of files resulting from other two site synthetic schemes would 
be diedoed in a similar fashion. When the known molecule was tested against a file set 
constructed from a synthetic schenoe having diree ates at which a structural variation could 
occur, the sum of the matching fing^rints contributed fiom three structural variations 
would be used and tested against die fingerprint of the known molecule in an identical 

10 manner. The actual method embodied in the software, performs many quick checks on each 
sei of structural variation files and quickly ascertains whether that set of files could yield a 
product structure with the required structural characteristics (fingerprint in this example). If 
the quick check indicates that the set of files could not yield the known molecule, the 
search is quickly advanced to the next set of files. In fact, on a parallel processing 

15 machine, many simultaneous searches are performed. Thus, the time to search the entire 
virtual library is relative short 

Several points ^ extremely important. First, the characteristic of the known 
molecule is checked against only files associated with the structural variations. Thus, a set 
of associated files containing 2,000 structural variations (where 1,000 structural variations 

20 may occur at each of two sites) requires Uie examination of only 2,000 structural variations 
to accomplish a search of 1,000,000 (1,000 x 1,000) possible product molecules. Second, 
during the search only the structural variations which would contribute to a molecule 
having the desired structural characteristics are identified. Only after all such structural 
variations are identified, are the actual product molecules assembled from the structural 

25 variations and their entire structure specified for inclusion in die desired subset. Third, it 
does not matter whether the known molecule could be synthesized by a known 
combinatorial scheme. The information derived from a search such as in the example, 
would identify those molecules which could be dwived from a combinatorial scheme which 
most likely have the same structural and biological charact^stics as the known molecule. 

30 However, in creating the virtual library, all that is required is that the compounds can be 
described by a CSLN. The searching metiiod of this invention, could equally well find one 
or more of these molecules not derived from a combinatorial syntiietic scheme as being 
likely to have the same structural and biological characteristics of the known molecule. The 
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only diffidence in this later case is that no information about a possible synthetic route is 
available from the results of the search. 

Qearly, (he greater the number of compounds specified in die enUre virtual library 
data base wliether based on known combinatorial synthetic schemes or resulting from other 
S syndiedc pathways and expressed as a CSLN, the greater the likelihood of finding 
molecules witti similar smictural and biological characteristics. Fourth, such structural 
seardies require the use of validated metrics exhibiting a neighborhood prop^ty to 
characterize both the structural variations and the known molecule. Fifth, (Mice the virtual 
library data base is constructed based on the m^od of this invention, there are any 
10 number of different types of searches which can be run. The software code provided with 
this application permits many such searches as outlined in the descriptions of the code 
below. 

^ ii. Desipn Scre«iing Libraries (Subsets of the Virtual Ubrarv) 

In the current invention, one single method is used to select among all possible 

IS products from one or more reactions which share a common core substructure. A bitset is 
used to represent all the possible products (generally in the tens of mUlions). One may 
choose to limit the design subset selection to those compounds which are made of reagoits 
from a q[)edfied subset of suppliers, to those of suitable price, to those of suitable 
molecular weight, logP, etc. One may seed the design with a set of preselected products. 

20 One may remove all products in the neighborhood of a subset of compounds as a preface to 
the de^gn run. 

The design process, once all the above initial subset operations have been 
perfcnmed, is extremely simple: 

• select a compound to add to the design, and remove its neighbors from 
: iZS further oon^deration 

• continue until no other compounds are left 

The selection may be random, or may be directed to maximize use of a reagent once 
selected (this matches the practical requirements for a laboratory two-step synthesis in 
which maximum use of the first step's intermediate structure offers a substantial advantage 
30 in speed and cost). In principle, any rule can be invoked to prioritize which compound to 
select next, since any remaining compound is allowable at every step. Examples of this 
type of search are giveri below. 
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fa) Subset Screening Library Based On Topomeric Fields 
amdTanifTOtp 

A sdection of a screening library based on the same criteria as were discussed in 
the first part of this application is easily impleoiCTted using the virtual library. The library 
S members are identified based on tppomm (is the distance too small in topomer space) and 
cm Tanimoto similarity separately, as was done in the earlier disclosed method. However, 
every reagent is always allowed, unlike-the earli^ method in which only a small subset of 
reagents made it through the reagent filto- to the product stage. The eadier methods 
selected products based on maximal dissimilarity of pnxluct Tanimoto at each selectim. 

10 Since by imng the virtual library only the final selection set (all possible combinatoriaily 
created molecules meeting the selection criteria) is used, and does not depend upon or rdy 
upon the ordering within a selected set (of reagents), the virtual library method is more 
flexible and in practice faster than die earlier disclosed mediod. In fact, since the product 
selection is not constrained by reagent stage selection, somewhat largo- screening libraries 

15 result from using the virtual library. The overall process of using both the topomeric 

CoMFA and Tanimoto metrics to search for molecules in the virtual library is summarized 
in the flowchart of Figures 24, 25, and 26. Code to implement this search, db^des, is 
contained in Appendix K. A more extensive description of the code may be found in 
section G which foUows. 

20 (b) Subset Based on Tanim oto Similarity 

A subset of the virtual library chosen just based on Tanimoto similarity/dissimilarity 
of product molecules, which could be created meeting some initial selection criteria, can be 
directty chosra. Code to implement diis search, dbcslqs, is contained in Appendix I. A 
more extensive description of the code may be found in section G which follows. 

25 fc) Subset Based on Topomeric Fields 

A subs^ of the virtual library chosen just based on topomeric CoMFA field 
similarity/dissimilarity of product molecules, which could be created meeting some initial 
selection crit^a, can be directly chosen. Code to implement this search, db_qstop, is 
contained in Appendix J, A more extensive description of the code may be found in section 

30 G which follows. 

fd) Subset Based on Combined M^c 
A subset of the virtual library may be based as well upon the combined topomeric- 
fingerprint metric described earlier. Code to implement this search, db^both, is contained 
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in Appendix L. A more extensive description of the code may be found in section G whidi 
follows. 

iii. Pcsienine Lwd QptimizatiOTS 

The various tediniques of lead optimization to explore the island of activity 
discussed earlier. The same techniques used with the virtual library are much more 
powerful ^nce a vastly laig^ chemical universe is t>dng investigated. Generally, any 
property assodated with a structural variation in the virtual library can be used to expand 
and define the product mdecules sought 

Subs^ of molecules from the virtual libraiy (btabase may be sdected based on 
descriptors typically including, but not limited to, the following: 

• reagent identifier 

• reagent supplier 

• reagent or product molecular weight 

• reagent or product price 

• reagent or product estimated log? 

• reagent shape contribution; product shape contribution under certain restrictions 

• reagent or product 2D fmgerprint 

• product sid>structural features 

Subsets may be selected by applying by the following methods, including, but not 
limited to, simple filters, by requiring that filters meet a specific d^ree of similarity to 
reference compounds, or by applying proprietary design tools. 
Specifically, the initial modes of subset selection may include: 

• substnictural searching, to identify compounds which have a set of required 
structural features, is p^hq)s the most often used method of chemical database 
subset selection 

• 3D feature searching, to add interatomic distance requirements to the 
substnictural searching, is also familiar to experts in chemical database 
searching 

• similarity searching, to find subsets which are substantially like a reference 
compound, is widely used as well and corresponds to application of a 
neighborhood principle applied to 2D fmgerprints or -planned extensions * atom 
pair distance fingerprints, etc. 

• scalar searches corresponding to traditional nonstructural database queries, to 
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find compounds wiUi for example IpgP between 5 and 8 and molecular weight 
under 500 and price above 7S0. 

• maximum dissimilari^ queries, which are used primarily to order a lai]ge subset 
of compounds sudi that as one leads down the ordoed list, compounds aie less 

S distinct from each other as a group 

• STIGMATA (a procedure pc^mlaiized by the sciaitists at Parke Davis) queries, 
in which compounds are selected based on the preseiKe of specific bits in a 
fingerprint (2D, atom pair, pharmacophore triplets, etc.). Commonly such a 
query is derived by reference to a set of desirable compounds, from which the 

10 bits present in all compounds in the set are derived. 

• design queries (scalar, topomer, fmgerprint, arbitrary weightings of any of 
these) of dther of two types: 

• gridding methods, in vdiidi the objective is to have one compound 
within each specific ^'hypercube" of the design sp3cc 
IS • neighborhood methods, in which the d)jective is to obtain a set in 

which no two compounds are overly similar, and in which no "holes" 
exist needlessly 

fa) Search Based on Tanimoto Similarity 

Details of a typical lead optimization using the Tanimoto metric were highlighted under 
20 section 12(E)0) above. Essentially, what is sought is a list of all compounds to be found mdiin 
the Tanimoto ndghborhood of the lead. Code to implement this type of search, db_sim, is 
contained in Appendix H. A more extensive description of the code may be found in section 
G which follows. 

fb) Searches Based on Tonomer Similarity 

25 The notion of topomer similarity of a pair of molecules is well defined if the molecules 

have some common "core". An enhanced method has also been discovered which allows 
arbitrary ^ctures as search queries not just Uiose which result from a combinatorial 
synthesis. Therefore, to find molecules similar to some target within the virtual library, the 
following three phase operation as summarized in the flowchart of Figuies 27, 28, 29, and 30 
30 must be performed: 

1) Determine which of the "common core" substructures (where the core may 
consist of a single bond and any single bond is equivalent to any other single 
bond for topomer searching) wjthin the virtual library are wholly contained 
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within the search target molecule. This can be done by any standard searching 
program, such as Tripos* Unity package. 

2) For each of the common cores found, remove that common core from the 
seaidi target The atoms remaining will comprise <me or more ^de chains. 

5 Generate the topomeric confcHmations of eadi of the side diains, using the same 

code that is used to build topomeric conformations during library ("all posable 
products") goiCTation. Generate die topomeric conformation of the core. 

3) Using these topomeric conivmnations of each of the target molecule's dde 
chains, search the combinatorial libraries corresponding to the previously 

10 identified common cores for ail side chains whose sum of corresponding side 

chain topomeric differences is less than the neighborhood radius within the 
typical neighboriiood range of 80 - 100 kcal/mol. (91 kcal/moL) Alternatively, 
the root sum of square differences between the fields may be used to determine 
the selection criteria. TTie procedure is shown in the flowsheet of Figures 27, 
IS 28, 29, and 30 and described below. 

(6i Topomeric f3D) Searchiny of Arbitrary Molecular Structures 
In addition to searching the virtual library as outlined above, it is possible to conduct 
searches which were heretofore impossible by any means. In particular, a critical question 
whidi firequendy occurs in chemical research, and especially in biological research, can now 
20 be addressed by the discovery and creation embodied in the virtual library. The problem, as 
it is usually presented, takes the form: given an arbitrary query molecule (generally one 
previously found to exhibit a desired activity), find biologically similar molecules, that is 
molecules of similar 3D shape and activity, dutt can readily be made and tested. Generally, 
such a query molecule will not have resulted from a combinatorial synthesis, and, in fact, no 
25 knowledge of a possible synthetic route to the molecule may be available. As an example, 
suppose that compounds similar in 3D shape to but structurally different from the structure 
(written in SLN) CH3C(=0)NHCH(CH3)CH2NHCH2CH20H are desirable, pertiaps because 
this hypothetical structure was reported to be highly acdve in a competitive pharmaceutical 
preparadon. 

30 As described earlier, the topomeric 3D shape data within the Virtual Libraries actually 

describe fragments (structural variations) of molecules. To find similariy shaped molecules 
within the virtual library, the query molecule must be fragmented and the shapes of its 
fragments compared with the shapes of corresponding fragments (structural variations) in the 
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virtual library. The difficulty is that a query molecule can be fragmented in so very many ways 
for searching against tiie viitual library containing in excess of l(n2 molecules. (The example 
given has nine bonds connecting heavy atoms, so there are nine two-ftagment combinations 
that could be considered, 9 x 8 72 three-fragment combinations, 9x8x7=^ 504 
S four-fragmoit combinations, etc*) Given this situation, what is needed is a way to emphasize 
those fragmentations that are most litely to conform to effident synthetic routes fiom available 
starting materials, without requiriog the searcher of ttie vutual library to have any knowledge 
of what synthetic routes it includes. 

The solution to this |Hoblem which can be uniquely achieved with the virtual library 

10 is a "fiagmentation table", where each row constitutes a rule of the following sort: "for eadi 
occurrence of this particular structural feature combination (structural variation) in the query 
molecule, decompose the query molecule in a particular way specified in terms of this 
structural feature, and search only those combinatorial libraries that utilize specified reactions 
(sequences) and/or buOding blocks, making specified query fragments onto specified classes 

15 of building blocks*. Each such query decomposition found generates a search of the virtual 
library, returning all those products whose sum of squares of differences in shape between 
corresponding produa and query fragments is less than a user specified neighborhood distance 
direshold. Passing the query molecule (by means of a suitable computer program) against all 
the rows of this table generates all searches. 

20 To illustrate this approach with a simple example, one row in the table might have as 

its structural feature C(=0)-I!r]NH (amide bond, where [!r] states that the preceding bond 
must not be cyclic). This row would specify cleavage between the N and C of any matching 
fragment withiri die query, for our example query yielding the fragments CH3C(=0)- and 
-NHCH(CH3)CH2NHCH2CH20H, and the charac^stics Uiat a matching subset library 

25 should have Q)rimary or secondary amine reacting with an acid, acid chloride, isocyanate, 
chlorofonnate). The similarity searching engine then returns all products in the virtual library 
formed from amines close enough in shape to -NHCH(CH3)CH2NHCH2CH20H and acy lating 
reagents close enough in shape to CH3C(=0)-. 

Note that the amide bond is a synthetic convenience, not an absolute arbiter of shjq)e 

30 similarity. Molecules in which the amide bond is "reversed" might also be sufficiently shape 
similar overall to have biological similarity to the query molecule, de^ite the local differwices 
in shape resulting from the NH to C=0 mismatch. Indeed, any reaction that forms a single 
acyclic bond might contain bioisosteres of our query molecule within its virtual library. On the 
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Other hand, an amide library would contain both the most accessible and also the largest 
numb^ of bioisosteies and so this is the library that should first be seardied. 

Another row in the fragmentation table might de^gnate a query decomposition into 
three fragments, with a structural feature R-[ir]NXN-[!r]R. Application of this row to our 
5 query molecule would genoate CH3C(=0)-, -NHCH(CH3)CH2NH-, and -CH2CIEOH, 
When searching the "diamine" lU>rary (about 10^11 structures) for similarity using these 
fragments, the "core* or diamine compcment is searched first for fragments similar in shs^ 
to -NHCH(CH3)CH2NH- (see below for a description of the special features of core steq)e 
similarity). Core ^fxape similarity is much rarer than sde-chain shape similarity and so an 
10 effident search process considers core similarity before considering side chain similarity. 

An example of what a few rows in a typical fragmentation table look like is shown 
below. The description of the individual named colunms are as follows: 

CLASS_ID = equivalent in meaning and value to CLASS_ID in the REACTIONS 
.table. Identifies a particular reaction sequence as it would be carried out in the laboratory. 
IS Only those virtual library records whose CLASS_ID matches this value will actually be 
searched. 

PRIORITY = Allows a searcher to control the dq)th of a search. Lower values 
correqxHul to reactions which are less general, but whose products are more likely to resemble 
a matching query. Deep&r searches will also consider rows having higher values of 
20 PRIOWTY. 

SLN = the structural pattern that will be matched within the query molecule. Each 
match found widiin the query molecule generates a decomposition of the query into fragments 
for topomeric similarity searching, as detailed elsewhere. 

REACTANTS = Allows the devetcf)^ of this table to limit application of a particular 
25 row to reactims involving particular classes of reactants. 

ATOMS = Specifies, by reference to the fiagment description with die SLN column, 
the bonds in the query whose breaking will generate the fragments to be used in topomeric 
field dmilarity searching. 

The three rows diown illustrate the three examples discussed elsewhere in this 
30 description: Row 1 - diamine derivatization; Row 3 - amide formation; Row 7 - thioether 
cleavage. For clarity the information for these rows is broken into three sections: 
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1 

CLASS ID 


2 

PRIORITY 


3 1 

SLN 1 


ROWl 


5 


2.00 


Hev-[!R]NXN-(!R]Hev 1 


R0W3 


6 


2.00 


HevHev(=OH!RlNHev | 




22 


2,00 


CS-[!R]HevHev | 








■ 1 

4 

REACTANTS 


ROWl 


Xl=RN=C=0,aC(=OpR,Epoxide,Ald/Ket,RC(=0)Cl. 
RCX)OH,RCOO[-].RS02a,ArF(activated),N:CHal,C=CCX,H 
X2=RN=C=0.aC(=0)OR,Epoxide,Ald/Ket,RC(=0)Cl, 
RCOOH,RCOO[-],RS02a,ArF(activated),N:CHal,C=CCX,H 


ROW3 


XI = Ainine(-' 3) X2=RC00H,RN=C=0,C1C(=0)0R,RC(=0)G1. 
RC00[-],RS02Cl,ArF(acUvated),N:CHal,C=CCX 


ROW? 


XI =RSH X2=RN=C=S,RN=C=0,RS02a,RCl,ArF(activaled). 
N:CHal,RBr 








5 

ATOMS 




1 ROWl 


1,2 5.4 


3 ROW3 


4.2 2.4 


7ROW7 


2.3 3.2 



The powa- and utility of topomeric steric field analysis of fragmented structures is 
highlighted by a recent analysis of the structures of Tagamet and Zantac (H2 antagonists). 
Tagamet and Zantac wwe each fiagmaited according to Row 7 of the fiagmentation table and 
Ihe tqponrcric steric fidds calculated. The metric distance (differaice in mrtric values) for the 
two compounds was 127. 

Remembering that a range of 80 - 100 defines a neighborhood distance for an 
aH>roximate log2 biological difference for the topomeric CoMFA descriptor, the value of 127 
strongly suggests that Tagamet and Zantac should have similar biological activities. Such 
knowlec^e would have been very useful to those either seeking to protert molecules with 
similar structure/activity to the known molecule or to those seeking to find molecules which 
look similar to the receptor but which are not entirely structurally identical to the known 
molecule. It should be noted tiiat other widely used diversity approaches, 2D fingerprints and 
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phamiacophoric patterns show a remarkable lack of similarity between the drugs. Indeed, in 
the topomeric configuration generated by the methods of this invention, Tagamet and Zantac 
look very similar even to the unaided eye as shown in Figure 31. 

(dS Tcyomeric (2D\ Searching of Core Structures 
S An ancillary problem when attempting to find molecules in the virtual library 

(constructed principally from combinatorial chemistries) which are structurally and bicdogically 
similar to a given query molecule, is the treatment of-the cmtral core to which structural- 
variations can be attached. The virtual library defines the shape similarity of two molecules 
as the sum of the similarities of comparable fragments. ''Core* fragments are any fragments 

10 that have multiple attachment bonds to other fragments, in contrast to "side chain" fragments 
which have only one attachment bond. 

Overall molecular shape will be affected most by the relative positions of core 
attachment bonds. Consider the three possible bivalent phenyl cores, ortho, meta, and para. 
These will be quite similar in their intrinsic shapes - only a hydrogen changes place - but the 

IS molecules derived from the three cores will be very different in sh^ if the side chains are 
at all bulky. Therefore in considering die shape similarities of cores the relative positions of 
attachmmt bonds must be weighted far more heavily than the shape differences themsdves. 

The prior art has attempted to deal with this problem. Lauri and Bartlett'^ have 
described CAVEAT, which in the nomenclature of this disclosure would be considered a "core 

20 similarity" searching-system that considers only relative attachment bonds, not shape, of all 
theoretically constructible cyclic cores. In their work, the relative geometry of two attachm^t 
bonds is expressed in terms of their distance, angle, and torsions. In contrast the present 
invenu>rs have found that a much more sdf-consistent sh^ classification of, for example, all 
7S0 commerdally offered diamines, is obtained when one of the attachment bonds is aligned 

25 on the X-axis (as in the standard topomer conformation, described earlier) and the differences 
calculated as the root mean square of summed differences in the x, y, and z coordinates of the 
two Olds of the other attachment bond. (The conformation used in this procedure is the 
topomoic conformation of the core with a methyl group replacing the more distant attachment 
bond.) This procedure difierentiates cyclic from acyclic fragments much more strongly than 

30 it differs amcmg the linear acyclic moieties pentylyl, hexylyl, and heptylyl. 

In addition to this RMS difference in x, y, and z, the differences in steric (and any 
other fields) also contribute to the bioisosteric differences between two cores. Because Aerc 
are potentially two or more possible attachment bonds in a core, there are two or more ways 
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in which two or more cores may be compared. So the difference in fields is taken as the least 
of these possible differences. The combination of two descriptors in considering the difference 
between two core stnictures, the attachment bond differences and the field diffeiences. 
introduces a idative wdghting concern. In inactice it has been found in clustning experiments 
5 like those described for the tfiiols that the internally most sdf-consistent dassification of 750 
diamines results from numoically equal wdghting of the two RMS differences. 

Thus, the successful genoation of a U^meric descriptor for^coies involves two 
advances. In comparismi witfi the procedure for side chains, the lelative.positicMi of attachment 
pdnts has been introduced, for example, to distinguish wtho phenylene from para irtioiylene. 
10 In onnparison with the treatment of attachment points previously described by Bartlett et al. , 
the use of differences in x,y,z coordinates, rather than relative geometric such as distances 
and solid angles, provides a stronger differentiation needed between, for example, cyclic and 
acyclic cores. 

G. Code Attachment's 

15 The fidlowing software code comprising the main sections of the invention is described 

bdow and is attached in the Appendices. In addition, necessary auxiliary code is also set fordi 
in the Appendices. AU together, all code necessary to fully disclose an enabling embodiment 
of the invention in the computational chemistiy environment specified earlier is set forth in the 
several appendices. In some cases new code is provided which differs from that in the priority 

20 documents to include enhancements de^bed in the text. In particular, as the virtual library, 
has been expanded, it has been found that the larger number of compounds identified from the 
searches is more convenientiy handled which can deal witii bitsets ratiier than as ASQI text 
The additional auuliary code required to manipulate the bitsets is contained in Appendix R. 
However, the use of bitsets is a computation convenience and does not involve any change in 

25 the construction or seardiing of the virtual library. 
Appgpdix A; 

One section of the code in this Appendix generates topomeric conformations, and 
anotiier section generates the best slope line for Patterson pl<«s. 

Appendix B: TTiis code calculates Uie hydrogen bond variation to be 2q)plied to tiie 
30 . CoMFA steric field. 

Appgn(l>x E; gct^cd.coff^ This code handles the first phase of the construction of tiie 
virtual library. 

Appendix F;SYB MGEN GPIi? COMFA HFY***rTnP«; This code calculates the 



wo 97/27559 PCT/US97/D1491 

104 

topcnneric CoMFA field of each structural variation and adds it to the structural variation files. 
It also allows the computation and use of other than just steric fields. This Sybyl expression 
generator, written in;C, in invoked from SPL by a call %conifa_hex(Row Colunui). It returns 
an ASCII hexadecimal represratation (0-9,a-f) for each CoMFA grid point in row "Row" and 

S CoMFA column "Column" in the string which is seen as CTOPS in the input files. 

The encoding is as shown in Uie subroutine lookup_my_comfiajoodeO. As indicated, 
a missing value is~assigned "0" and all legitimate values^re assigned a number according to 
their numerical value. The binning is not quite linear; since the CoMFA values are 
infrequently between 10 and 30 this was empirically found to repnxluce the exact CoMFA 

10 distances v^ well. The distances arising from this CTOPS description were validated against 
data sets to confirm that the encoding and decoding introduced no significant roundoff 
problems. The distance conespondihg to the coded topomer field values of CTOPS are seen 
in the dbcsln_des routine called WhatsTheDifferenceQ. 
A ppendix G: dbcslnprepro 

IS This program takes the description of the common core and solicits for each substituted 

portion the SLN for the extended core. From this, and the list of structural variations, it 
computes the fingerprint and the fingerprint*s cardinality for each structural variation and 
appends this as the fpcard and fp fields. 

Additionally, the program creates a specified fraction of product compounds and 

20 computes their fingerprints exactly. The actual product fingerprint is-compared to the 
fingeiprint estimated firpm the pieces, and any discrepancy is noted by counting how many 
tested products have 0 missing bits, how many have 1 , etc. The largest observed value is used 
as the MBITS parameter for the reaction. Hie new version of this code perforins the same 
functions as the original code except that it writes separate files for fjpcard znd-fp. In addition, 

25 it forms a master file to keep track of the association of all the files. 
Appends H; <lbC5ln^ini 

This program takes one or more SLN structures as queries, along with the MBITS and 
the desired Tanimoto similarity, and the output of the dbcslnprepro run. It produces a listing 
of all products which may be above the Tanimoto cutoff value, by listing the index of each 
30 structural variation and both the apparent Tanimoto and the maximum possible Tanimoto (it 
is the maximum possible Tanimoto which defines the results). This code now reads master files 
and can read bitsets output from other files. 
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Appendix I: dbcslnOS 

Hiis program takes the results of the dbcsinprepro program, along with the MBITS and 
the Tanimoto ^milarity ndghborhood, to select a designed subset based on Tanimoto similarity 
alone. Additional options allow one to remove from consideration products with a parameter 
S outside of the desired range (such as molecular weight or logP or price), and to remove all 
products whose enumerated fields for one or more reagents are not in a list of accq>table 
choices (such as supplier). 

The design selection consists of first removing products from consideration based on 
range of variables or aocq)tability of reagent An initial selection is made, normally by random 

10 sdecticHi among all remaining products. Every product whose maximum possible Taninwto 
similarity is above the cutoff is removed from further consideration. A product is then selected 
from among all remaining products, either randomly or by rule to continue using one of the 
reagents (Rl, R2,etc) so long as possible (so long as any product remains using that reagent). 
This sdected product's neighbors are removed from further consideration also, and this simple 

IS loop continues until no products remain or a maximum specified number of selections have 
been made. The loop is simply: select, remove neighbors in Tanimoto space. 

AppCT<lix J: dbclsmstop 

This program takes the results of the dbcsinprepro program, along with a value to 
define the topomeric similarity neighborhood, to select a designed subset based on topom^c 
20 similarity alone. 

This program operates exactly like dbcslnQS, except that the step at which neighbors 
are removed is based on tqwmeric similarity based on the CTOPS fields of the reagents, 
rather than the estimate of Tanimoto similarity. Thus after a selection it scans all remaining 
products to find every one which has a distance within the similarity radius, and marks these 
25 neighbors as unavailable for further consideration. 

(Note that this is equivalent to doing a topomeric similarity search for each selection. 
The results are not returned to the user, since their use is to make potential selections 
disaiq>ear!) 

, A ppendix K: dbcsln_des 
30 This program takes the results of the dbcsinprepro program, along with the MBITS and 

the Tanimoto similarity neighborhood, plus a value to define the topomeric similarity 
ndghborhood to select a designed subset based on Tanimoto similarity and topomeric similarity 
acting independently. TTiis corresponds closely to the method of designed subset selection in 
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the earlier described method. This code now reads and writes master files and bitsets. 

This program operates exactly like dbcslnQS, except that in addition to removing every 
Tanimoto ndghbor of the selected compound, we also remove the topomeric neighbors. Hius 
after a sdecticai it scans all remaining products to find every one which has a distance within 
S the Tanimoto range, removes them, scans all remaining products to find every one which has 
a distance within the Uqxrnio' range, and removes them. 

This is equivalent to doing the dbcslnQS and dbclsn_qstop one after anoth^ in the 
innermost loop where neighbors are identified and removed. By setting either the Tanimoto 
or tqx>mer ndghborhood radius to be zero, one should be able to achieve the same results as 
10 dbclsn_qstop or dbcslnQS in fact. 

Appendix L: dbcsln both 

This program takes the results of the dbcslnprepro program, along with the MBITS and 
a way to scale topomeric distance, plus a similarity cutoff for the combined descriptor of 
topoiner and Tanimoto, to select a designed subset based on Tanimoto similarity and topomeric 
IS similari^ acting as one combined descriptor. 

This program operates exactly like dbcslnQS, except that the removal of neighbors is 
not based on either Tanimoto or topomeric distance by itself. 

This utilizes the new, combined descriptor described earlier. It is not directly equivalent 
to either dbcslnQS or dbcIsn_qstop in this sense. This code now reads and writes master files 
20 and bitsets. 

Appendix M: dbcslntohits 

This program takes the index results of dbcslnQS, dbclsn qstop, dbcsln_both, 
dbcsbijdes, or dbcslnsim and generates a full product structure SLN hidist for them. This 
hidist of products is suitable for treatmentjust as any set of chemical compounds - it loses its 
25 combinatorial idi^tity as it becomes an assembly of independent chemicals. The new version 
of this code can now work with bitsets. 

A ppendix N: CODATA 

This is a header file to declare vari2rf>les. 

A ppendix O: DBJJTL 

30 This code is a set of subroutines used in many places, and, in particular, by the design 

programs. 

A ppendix P: ELIMATE 

This code is a set of subroutines used in many places, and, in particular, by the design 
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prpgrams. 

Appendix O: FILTER 

This code contains subroutines for filtering undesired characteristics from product 
molecules. 
S Appendix R: dbcsln bitset 

This code provides the additional routines need and called by the oth^ code to handle 

bitsets. 

Appffwlffi $; trosim 

This code performs a topomeric CoMFA search for molecules similar to a query 
10 compound. 

Appendix T: topsetup.core 

This code performs the fragmentation required to implement a topomeric search of a 
query molecule not necessarily derived from a combinatorial synthesis. 

From the proceeding description of the construction, gen^tion, and searching of a 
15 virtual library, it should be clear that tfiere are many variations which may be employed and, 
having taught how to generate and search one specific embodiment, all equivalent embodiments 
are con^dered within the scope of this disclosure. 

While the precedmg written description is provided as an aid in understanding, it should 
be understood that the source code listings appended to this application constitute a complete 
20 disclosure of the best mode currently known to the inventors of the methods of constructing 
and searching the virtual library and obtaining selected subsets of molecules with specified 
characteristics. 

Thus, while this invmtion has been particularly described with reference to the drug 
lead identification art, it is dear that the validation of molecular structural descriptors and their 

25 use in selecting structurally diverse sets of chemical compounds can be applied anywh^ a 
large number of compounds is encountered from which a representative subset is desired. Since 
the implications and advances in the art provided by the methods of this invention are still so 
new, the entire range of pos^ble uses for the methods of this invention can not be fully 
described at the present time. However, such as yet identified uses are considered to fall under 

30 the teachings and claims of this invention if validated molecular structural descriptors are 
employed to characterize the diversity of molecules. 
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APPENDIX "A" 
(glexpressionjgencrator CH0M_THIS_BUILD_3D 

# top level routiiie for generating topomeric conformer 

# CHOM!INIT_BIULD_3D must be called beforehand 
9 r^ums true unless something went wrong 

# 

10 globalvar CHOMlAUgn 

localvar ma msav rid pat tpat p sin noth zs al n capsln \ 

polypat patats mpats allpatats 
localvar polyats matl mat2 schns rbs sybat aneigh ans i \ 
mcore jbds tors msin 
IS setvarmaSl 
setvar tid S3 

setvar c^ln $CHOM!Align[ SLN ] 
setvar polypat $CHOM!polypat 
setvar mcorc $CHOM!Align( MINIT ] 
20 setvar msln $CHOM!Align[ MSLN ] 

# fix N02's (egad what a pain) 

setvar pat %search2d( %sln( $ma ) N(=0)0 ALL 0 y ) 
while $pat 

setvar pat %sln_igroup_sybid( $ma %arg( 1 $pat ) 1 3 ) 
25 modify bond type %bonds( %cat( %arg( 1 Spat ) \ 
%arg( 2 Spat ) ) ) 2 >Snulldev 
modify atom type %arg( 2 Spat ) o.2 >Snulldev 
setvar pat %search2d( %sln( Sma ) N(=0)0 ALL 0 y ) 
endwhtle 
30 ifSCHOM!Align[DEBUG] 
label id * 
endif 

# basic <^timization 
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swiuA $2 
caseNOBUILD) 

case CCH^CORD) 
if %not( %chom_concord( $ma )) 
goto bad_en^y 

endif 

»f 

case MINIMIZE) 

MAXIMIN $ma DONE INTERACTIVE >$nulldev 
if %gt( $maximin2_energy 1000 ) 
goto bad_eneigy 

endif 

>» • ' 

endswitch 

setvar GHOM!Align[ RBDS ] 

# done, if only 3d coord, but for CoMFA . . 
if %streql( $4 *A- ) 

# detect (pto)chiral atoms for adjustment, adjusting and 

# removing any of pre-defined chirality 

setvar CHOM!Align[CHIRAL] %set_create( %atoms({chiral(*.RS)}) ) 

# find a 2D hit 

setvar pat %search2d( %cat( %sln( $ma ) ) Scsipsltt NoDup 0 y ) 
if %not( Spat ) 

echo Scapshi not found in %sln( $ma ) from Row $rid skipping 
return 
eiidif 

setvar pat %arg(l $pat ) 

# now find the (first) pattern that matches the aligning fragment AND whose 

# atoms are contained by this SLN hit 

setvar allpatats %set_create( %sIn_igroup_sybid( \ 

$ma Spat %range{ 1 %sln_atom_count( Scapsln ) ) ) ) 
setvar mpats %search2d( %cat( %sln( Sma ) ) Smsln NoDup 0 y ) 
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for pat in Smpats 

if %not( %set_diff( %setjcreate( %sln_rgroup_sybid( $ma Spat \ 
%range( 1 %sln_atoin_€ount( Scapsln ) ) ) ) Salipatats ) ) 
break 
5 endif 
endfor 

setvar polyats %setjcreate( %sln_rgroup_sybid( $ma Spat Spdypat ) ) 

# allow user siq^lied mutine to adjust initial confonner 

if SCHOM! Alignt FIX^CF.G ALLB ACK ] 
10 SCHOM!AUgn[ FIX_CF_CALLBACK ] $ma Sallpatats 

endif 

# collect all atoms for MATCH and 

# and all the info on roots of torsions needing setting 

# (.^ = all tx>nds to atoms that are 

15 # polyvalent within the aligning fragment, except bonds that are (1) 
9 in rings or (2) connected to some other atom polyvalent within the 

# aligning fragment). 

setvar mat! 
setvar niat2 
20 setvar schns 

setvar rbds %set_create( %bonds({ringsO}) ) 
for a in %range( 1 %sln_atomjcount( Smsln ) ) 
setvar matl Smatl $CHOM!patats[ Sa ] 
setvar sybat %sln_rgroup__sybid( Sma Spat $a ) 
25 setvar mat2 $mat2 Ssybat 

# build torsion root lists 

if %set_and( Ssybat "Spolyats" ) 
setvar aneigh %set_create( %atomJnfo( Ssybat NEIGHBORS ) ) 
setvar ans %set_diff( Saneigh Spolyats ) 
30 for i in %set_unpack( "Sans" ) 

if %eq( %count( %atom_info( Si hfEIGHBORS ) ) 1 ) 

goto notoroot 
endif 
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if $rbds 

if %setjand( Srtxls %bonds( %cat( $i $sybat ) ) ) 

goto nocoroot 
endif 

5 endif 

setvar tors %set_difr( Sandgh $i ) 

# if theie are multiple possible torsional ioot« 

# one that is part of the root main chain 

if %gt( %set_size( "Stors" ) 1 ) 
10 if %set_and( "$tors* Spoiyats ) 

setvar tors %set_and( $tors Spoiyats ) 

endif 

endif 

• # if there are still multiple choices, just have to pick arbitrarily 
15 if Store 

setvar tors %arg( I %set_unpack( Stors ) ) 
setvar schns Sschns %cat( Ssybat Stors "/ $i ) 
endif 

notoioot: 
20 endfor 
endif 
endfor 

setvar dofit MATCH %cat( Smcore %set_create( $matl ) ) \ 
?6cat( $ma %set_create( $mat2 ) ) 
2S Sdofit >$nuUdev 

if SCHOM! AlignfDEBUG] 

edio %prDmpt( INT i « - - - ) 
endif 

# do FIT 

30 if %gt( $MATCH_RMS $CHOM!AUgn[ FITRMS J ) 

setvar CHOMIBadRows %set_or( "SCHOMIBadRows" $rid ) 
echo Bad geometric alignment (MATCH_RMS = $MATCH_RMS) \ 
for Row $rid skipping 
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endif 

M »de chain alignmmts 

switch $CHOM!Align[ AUCYC ] 
5 case User^Macro) 

$CHOM!Alignt AUDATA ] Sma $CHOM!AUGN( MCORE J 

»» 

case AIi_tians) 
case With JTcmplates) 
10 setvar nojrings TRUE 

for i in $sdins 
setvar jbds %set_unpack( $i ) 

# can set "side chain** bonds only if connecting bond is not cyclic 

if %set_and( "STbds" *%bonds( %cat( %arg( 3 Sjbds ) \ 
15 - %arg( I $jbds ) ) )" ) 

setvar nojrings 

else 

CHOMIAlITrans Sjbds 
endif 

20 endfor 

if SCHOM! Align[DEBUG] 

echo %prompt( INT 1 " " " ) 
endif 

if %streql( $CHOM!AIignt ALICYC ] With^Temphtes ) 
25 setvar f %open( $CHOM!Align[ ALIDATA 1 "r" ) 

setvar buff %read($f ) 
setvar slnma %cat( %sln( Sma ) ) 
while Sbuff 

# each line of text should have pattern, SLN IDs for the 4 torsion atoms, 
30 9 and a tordon value to set 

if %eq(%cdunt(Sbuff)5) 

setvar torpat %search2d( Sslnma %arg( 1 Sbuff ) NoDup 0 y ) 
for t in Storpat 
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MODIFY TORSION «sIn_rgroup_sybid( $ma $t %aig( 2 $buff ) \ 
%arg( 3 $buff ) %arg( 4 $buff ) ) %arg( 5 $buff ) > SnuUdev 

otdfor 
endif 

S endwhile 

%ck>se($f) 
endif 

»» 

endswitch 
10 endif 

# do a bump check? 

if $CHOM! AHgn[BUMPS] 
if %atoms({bumps(*,*)}) 
echo Bad steric contacts in aligned conformer for \ 
IS Row $rid skipping 

return 
endif 
endif 

# partial charges 

20 switch $GHOM!AIign[ CHARGE ] 
case None) 
ft 

case User_Macro) 

exec $CHOM!Align[ CHARGEDATA J $ma 

25 ;; 
case) 

CHARGE $ma COMPUTE $CHOM!Align[ CHARGE ] | >$nulldev 

ti 

endswitch 
30 %r^m( TRUE ) 
return 
bad_energy: 
echo Minimization failed — skipping molecule 
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return 




5 ©macro ALLTRANS chom 

M assumes defoult molecule, takes aisumoit atoms $1 and $2 

# where $1 is the JOINed atom of thencore, $2 is the atom that 

# the rest of the substituent is to l)e trans to, 

# and $3 is the JOINed atom of the substituent 
10 # starts from that atom and sets all side chains 

# to a trans confonnation 

# where choices exist, the largest chain is set to trans 

# and secondary chains "fall whereever they fair 

# manages chain branchings 
IS # ignores ring bonds 

globalvar CHOMlErr CHOMIAUgn 

localvar bds b bdset al a2 tmp sbcmds sats rbond pbds torsion ringbonds 
local var doit chir cats rgjoined b2set tval 
if %and( "Sbatch- •SCHOMIErr" ) 
20 -RETURN 
cndif 

# warn if angles will be ambiguous 

# setvar chir %set_create( %atomj5({chiral(*,RS)}) ) 
M check input for legality 

25 setvar tmp %set_create( %atomJnfo( $1 NEIGHBORS ) ) 
if %not( %eq(2 %count( %set_unpack( %set_and( \ 
•$tmp- %cat($2-,- $3)))))) 
echo Bad input to ALLTRANS (atoms $2 $3 not bonded to $1) 
return 
30 endif 

# save key bonds 

setvar ibond %bonds( %cat( $3 $1 ) ) 
setvar sats %conn_atoms( $3 $1 ) 



wo 97/rWS9 PCT/US97/D1491 

117 

if %not( $sats ) 

# echo No substituent atoms found in ALLTRANS 
return 

endif 

5 setvar sats $3 $sats 

setvar slxnids %set_create( %bonds( %cat( \ 

-{TO^ATOMSC %set_ciea!e($sats) ")}"_)) ) 

# define the other bcmds that might need adjusting 

s^var bds %setj:reate( %bonds( (♦-{RINGSO})&< 1 > ) ) 
10 setvar bds %set_and( "Ssbonds" "$bds* ) 
if %not( $bds ) 

return 
endif 

S discard bonds to primary atoms 
15 setvar mval %set_create( %atoms( \ 

<H> + <o,2> + <F>4-<I>4-<Cl> + <Br> + <nJ> + <LP> + <Du> ) ) 
setvar pds %set_create( %tonds( %cat( "{TO_ATOMS(" $mval ")}" ) ) ) 
setvar bds %set_diff( $bds $pds ) 

setvar CHOM!Align[ RBDS ] %set_or( $bds "$GHOM!Align[ RBDS ]" ) 
20 setvar ringbonds %set_create(%bonds({RINGSO}) ) 

# walk ail the important bonds 
for b in %set_unpack( $bds ) 

setvar doit TRUE 

# if this is the JOIN bond, already have some info 
25 if %eq( $b $rbond ) 

setvar aO $2 
setvar al $1 
setvar a2 $3 

# still need to be SURE we're not monovalent 

30 if %or( "%eq( 1 %count( %atomJnfo( Sal NEIGHBORS ) ) )" \ 
■%eq( 1 %counl( %atomJnfo( $a2 NEIGHBORS ) ) )" ) 
setvar doit 
endif 
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else 

setvar bdat %bond_info( $b OWGIN TARGET ) 
s^var al %arg( 1 $bdat ) 
s^var a2 %aig( 2 $bdat ) 

if «or( "%eq( 1 %count( %atoin_info( Sal NEIGHBORS ) ) )" \ 

•%eq( 1 %count( %atomJnfo( $a2 NEIGHBORS ) ) )" ) 
— setvar doit 
endif 
ifSdoit 

# which Old leads to lOOt atom? if necessary flip al,a2 to make that one be al 
if %set_and( '•%set_crcate( %conn_atoms( $a2 $ai ) )" $1 ) 

setvar tmp $al 

setvar al $a2 

setvar a2 $tmp 
endif 

setvar aOpath %transjpath( Sal $a2 $1 ) 

setvar aO %ZTg( I SaOpath ) 

endif 
endif 
ifSdoit 

setvar a3path %trans_path( $a2 Sal $CHOM!ALIGN[ attached ] ) 
s^ar a3 %aig( 1 $a3path ) 

setvar b2set %bonds( %cat( $aO Sal Sa2 Sa3 ) ) 
setvar rgjoined %s^_and( "Sringbonds" %setjcreate( $b2set-)-) 
setvar nigjoined %count( %set_unpack( "Srgjoined" ) ) 
setvar b2 %aig( 2 Sb2set ) 
if 95eq( 0 Snrgjoined ) 

setvar torsion 180 
else 

if %eq( 1 Snrgjoined ) 
setvar torsion 90 

else 

setvar torsion 60 
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endif 
cndif 

modify torsion $aO Sal $a2 $a3 Storsion >$nulldev 
if %seLand( "Scats" Sa2 ) 
5 MEASURE TORSION %aig( 2 Sa3path ) Sal $a2 Sa3 >$nulldev | 

setvar toraon Smeasure^toraon 
while %It( Storsion^O ) 

setvar torsion %inath( StorsicH) + 360 ) 
eodwhile 
10 if %gt( 180 Storsion ) 

CHOMIReflect $a2 Sal %arg( 1 $a3path ) \ 
%arg( 2 Sa3path ) %arg( 3 $a3path ) 

endif 
endif • 

15 setvar CHOM!AUgn[ CHIRAL ] %set_diff( "SCHOMfAlignl CHIRAL ]" Sa2 ) 
endif 
endfor 
#. 

©macro Reflect CHOM 

20 #=^=======^===== ================ ========== 

# does a controlled inversion, to convert prochiral atom to topmeric sterreoform 
localvar aiefl 

DEFINE PLANE %cat( SI S2 S3 ) PI >Snulldev 
25 setvarareflS4 

setvar arefl %set_or( Sarefl "%set_create( %conn_atoms( $4 $1 ) )" ) 
ifS5 

setvar aiefl %set_or( Sarefl S5 ) 

setvar aiefl %set_or( Sarefl "%set_create( %conn_atonis( $5 $1 ) )- ) 

30 endif 

REFLECT Sarefl PI >$nulldev 
REMOVE PLANE M* PI >$nulldev 

a. 
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©expression jgwierator CHOM^GONCORD 



§ does its best to generate a concord structure for the specified worfcarea 
S localvar ma p pat msav noth try 
setvarmaSl 

# fix indole atom typing problem 

setvar pal %search2d( %sln( $ma ) NH(:C):C ALL 0 y ) 
for p in Spat 

10 setvar tpat %sln_rgroup_sybid( $ma $p 1 ) 

nuxitfy atom only $^t N.ar 1 > $nuUdev 
endfor 

# renumber heavy atoms to avoid other problems 

# echo before renumber: %sln{$ma) 
IS setvar nuenum %molemptyO 

renuniber $ma Smrenum %atoms( *-<H> ) >$nulldev | 

c(q>y Smrenum $ma 

zap Smrenum 

setvar msav %molemptyO 
20 copy Sma Smsav 

setvar nats %mol_info( Sma NATOMS ) 

DEFAULT Sma >SnulIdev 
for try in %Fange( 13) 

CONCORD M Sma > SnuUdcv 
25 # Concord can return bond-less structures! or some different structure or do nothing 

setvar cok TRUE 

if %not( %eq( %molJnfo( Sma NATOMS ) Snats ) ) 
setvar cok 

OKlif 

30 if %eq( 0 %molJnfd( Sma NBONDS ) ) 
setvar cok 
endif 
ifScok 
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setvar noth %arg( 1 %atoms( <H> ) ) 
ifSnoth 

measure distance Snoth %atom_info( $noth NEIGHBORS ) > $nuUdev | 

s^var cok %gt( Smeasurejdistance 0.9 ) 
5 cndif 
endif 
if$cok 

break 
endif 

10 echo Concord failed try $try 
#echo %prompt( INT 2 ) 

cq;>y $msav $ma 
endfor 

if%not($ook) 
15 if %nol( $CHOM!Align[ FAST ] ) 

echo Concord failed for %sln( Sma ) - minimizing 
copy$msav$ma 
for try in %range( 14) 
MAXIMIN $ma DONE INTERACTIVE 
20 if %lt( $maximin2_energy 1000 ) 

break 
endif 

%file_delete(junk.his) >$nulldev 
DYNAMICS ml SETUP junk.his DONE Intoval^Loigth \ 
25 300,0 DONE FINISHED INTERACTIVE 

if%eq($try3) 
zap $msav 
return 
endit 

30 endfor 
else 

echo Skipping non-Concord structure 
zap $msav 
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return 
endif 
endif 

zap Smsav 
5 if SGHOM! Align[ CORE_SLN ] 

# need to find and record oth^ attachment point for trans^path *- 
g standard aligning group 

seCvar args $CHOM !Align[ CORE^SLN ] 
setvar msln % string Jnsert( %aig( 1 $args ) \ 
10 %arg( 3 $aigs ) %arg( 2 Sargs ) ) 

setvar msln %stringjnsert( $msln %arg( 4 Sargs ) Rl ) 

# cjuiU begin SLN with ( 

if %eq( 1 %pos( "(CHZ" $msln ) ) 

setvar mdn %cat( ''CH2C %substr( $msln 5 ) ) 
15 endif 

setvar rid %sln_rgroup_slnid( Smsln ) 
setvar hit %search2d( %sln( $ma ) Smsln NoDup 1 y ) 
if%not($hit) 

while %pos( Smdn ) 
20 setvar msln % string Jnsert{ Smsln " - " ) 

endwhile 

setvar hit %search2d( %sln( Sma ) Smsln NoDup 1 y ) 
oidwtule 

setvar rats %sln_rgroup_s^id( Sma Shit Srid ) 
25 if %not( Srats ) 

edio Pattern Smsln not found in %sln( Sma ) - missing core attachment 
return 

endif 

for cat in %set_unpack( Srats ) 
30 if %gt( %count{ %atom jnfo( Scat NEIGHBORS ) ) 1 ) 

break 
endif 
endfor 
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setvar CHOM!Align[ ATTACHED ] %set_create( Scat \ 

%set_diff( 96seLcreate( %atomJnfo( Scat NEIGHBORS ) ) Srats ) ) 

endif 

%r^m( TRUE ) 

5 #. 

Omacro INIT_BUILD_3D CHOM 



# prepsse and genexate global dsUa about template fragmmt 
10 globalvar CHOMipatats CHOMIpolypat 

localvar mcore msln capsln patats ys rat yrat nrat ^t a 

# srtvar mcorc SCHOM! AUgn[ MCORE ] 
ifSl 

setvar mcore $1 

15 else 

setvar mcore %molempty0 

endif 

default Smcore >$nuUdev 
if $CHOM!Align[DEBUG] 
20 label id 
endif 

setvar cq?sln SCHOM!AIign[ SLN J 

# use as is 

if SCHOM! Align[ ORIENT ] 
25 # orient template so that an R points in the positive X direction 
setvar ys %set_unpack( $CHOM!Align[ ORIENT ] ) 
setvar rat %arg( 1 Sys ) 
setvar nrat %arg( 2 Sys ) 
setvar yrat %arg( 3 Sys ) 
30 OiOENT USER Srat Snral Syrat > Snulldev 
endif 

# identify all the atoms for FIT, 

# Here we identify die SLN IDs of the polyvalent atoms 
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setvar %arg( 1 %search2d( Sc^ln Scapsln NoDup 0 y ) ) 
setvar polypat 
setvar CHOMipatats 

echo %sln_to_niol( Smooie $caps\n ) >$nuUdev 
S for a in %range(l %sin_atomj:ount( $cdpdn ) ) 

setvar CHOM!patats[ $a ] %sIn_ignHtp_sybid( Smcore Stpat $a ) 
if %gt( «count( %atoin_info( $CHOM!patats[ $a ] NEIGHBORS ) ) 1 ) 
s^var polypat Spolypat $a 

endif 
10 aidfor 

if $CHOM! Align[DEBUG] 
echo %prompt( INTl " " " " ) 
oidif 

copy $CHOM!Align[ MCORE ] Smcorfc 
15 zap $CHOM!Align[ MCORE ] 
setvar msln %sln( Smcore ) 
setvar CHOM!polypat Spolypat 
setvar CHOM!Align[ MINTT ] Smcore 
setvar CHOM! AUgn[ MSLN ) $msln 

20 i. 

n-c 

/*#moduleSYB_MGEN_CONN_ATOMS -Vl.O"*/ 

^include <ctype.h> 

^include <string.h> 
25 #include <stdio,h> 

#include "tajXHifig.h* 

#include "ta_types-h" 

/^lude "uti^mem.h" 

jKnclude •'uims2,h" 
;30 jfinclude "ta^math.h" 

^include "utl^geom.h** 

ftndude "ua_str.h" 

^include "molecule.h" 
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#mclude "utljist,h" 
#include "syb_uims_def.h'' 
j^nclude "mnu2/inacrosj)roto.h'' 
findude "syb/cxpr_p.h" 
5 iKndude "syb/areaj>.h" 
#include "syb/atabj>.h" 
iKnclude "syb/atomjj.h" 
/findude "uimS2j>.h'' 
jWnchide "utl_srt.h" 
10 /♦E+:SYB_MGEN_CONN_^BEST*/ 

* int SYB_MGEN_CONN_BEST( idcaitifier, nargs, args, writer ) * 

* Dick Cramer, Apr, 9, 1995 (written for SELECTOR use) * 

* - * 

15 ^ Expression generator diat returns the atoms attached to a given • * 

* atom, excepting the second, in a prioritized order. * 

* If there are two arguments, the ordering is by decreasing branch * 

* "size", where "size" is first any path with rings encountered, then 

* number of attached atoms, then KfW (paths in cycles end when an atom 
20 * in another path is encountered.) 

* If three arguments, the atom that is returned is the one that 

* begins the shortest path containing either of up 

* to two atoms referred to by the 

* third argument. If multiple such paths, ordering is same as for 
25 * two aiguments. 

* If last argument is DEBUG, all paths are written to stdout. 
* 

* Us^ interface: 

* %transj>ath( al a2 ( a3 ) (DEBUG) ) 

int SYB_MGEN_CONN__BEST( idenUfier, nargs, args. Writer ) 
char ♦identifier; 
int nargs; 
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char *argsQ; 
PH Writer, 

{ 

# define MAX^NP 8 
S struct pathrec { 

int root, firings, chosen, nats, done; 
float mw; 
setjitr path; 
atomj)tr a; 
10 }; 

struct pathrec p[MAX_NP]; 

int retvai, i, np, toroot, al, a2, a4, a5, a, pnow, pdone, growing, 
final_pos, area_num, new_rings, nats, nuats, elem, ncycles, 
best, debug, ringciosed, p2do; 
15 List_Ptr atom_exp_list=NIL; 

mol_ptr ml, m2; 

atomjtr arecl, are(2, arcc, a4rec; 

set_ptr atom_setl=NIL, a2chk = NIL, nuls = NIL, cnats = NIL, 
nxcn = NIL, end_atoms = NIL, scratch = NIL; 
20 char tempString[256]; 

float tl, t2, diff, poll, pot2, podiff; 
retvai = 0; 

/* Check the numb^ of axguments */ 
if ( naigs < 2 1 1 nargs > 4 ) { 
25 UIMS2_WRITE_ERROR( 

"Error: %transj)ath requires 2 to 4 argumentsXn" ); 
return 0; 

} 

np = 0; 

30 debug = (!UTL_STR_CMP_NOCASE( args[ nargs - 1], "DEBUG" )); 

toroot = (debug && nargs == 4) 1 1 (Idebug && nargs = = 3); 
/* PARSE THE INPUT */ 
/* get first atom */ 
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if (!(atom_expJisl = SYB_EXPR_ANALYZE( SYB_EXPR_GET_AT0M_TOKEN, 
aigs[0], 

&fiiial_pos» &area_nuni ))) 
goto error; 

5 if (!(inl = SYB_AREA_GET_MOLECULE (area_nuin))) 
goto deanif>; 

if (!(atom_setl = SYB_ATOM_FIND_SET (ml, atom_cxp_list))) 

goto error; 
if{ atomjesqi^list) 
10 SYB_EXPRJ>ELErE_RPN_LIST( atom_expJist); 

atomjexp^list = (List^Ptr) NIL; 
if{!(l == UTL_SET_CARDINALITY(atom_setl))) { 
UIMS2_WRrrE_ERROR( 
"Error; First argument must be only one atom\n"); 
15 goto error; 

} 

if (!(arecl = SYB>TOM_FIND_REC (ml. UTL_SET_NEXT (atom_setl, -1)) )) goto 
etior; 

al = aiecl->recno; 
20 UTL_SET_DESTROY( atom^setl ); 

atom_setl = NIL; 
/* get 2nd atom */ 

if (!(atom^exp_list = SYB_EXPR_ANALYZE( SYB^EXPR^GET^ATOM^TOKEN, 
aigs[l], 

25 &finaljpos, &area_num ))) 

goto error, 

if (I(m2 = SYB_^AREA^GET_MOLECULE (area_num))) 
goto cleanup; 

if (!(end_atoms = SYB_ATOM_FIND_^SET ( m2, atom^exp^list))) 
30 goto OTor; 

if( atom_exp_list) 

SYB_EXPR^DELETE_RPN^LIST( atom_expJist); 
atom_expJist = (List_Ptr) NIL; 
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if(ml!=in2){ 

UIMS2_WRrrE_ERROR( 

"&Tor: atoms must be in the same molecuIeVn"); 
goto error; 

5 } 

if(i(l mi,_SET_CARDINAIJTY(end_atoms))) { 
UIMS2_WIUTE_ERROR( 

"Enor: Second argument must be only one atomXn*); 
goto error; 

10 } 

if (!(areC2 = SyB_ATOM_FIha>_REC (ml, UTL_SEr_NKCr (end_atoms, -1)) )) goto 
error; 

a2 = arec2->recno; 
/♦ get 3rd atom */ 
IS if (toroot) { 

if (!(atDm_expJist = SYB_EXPR_ANALYZE( SYB_EXPR_GET_ATOM_TOKEN, 
aigst21, 

&final jxK, &area_num ))) 
gotoenor; 

20 if (!(m2 = SYB_AREA_.GET_MOLECULE (area_num))) 
goto cleanup; 

if (!(atom_setl = SYB_ATOM_FIND_SET ( m2, atom_expJist))) 

gotoemn'; 
if( sUom_expJist) 
25 SYB_EXPR_DELETE_RPN_LIST( atom_exp_list); 

alom_exp_list = (List_Ptr) NIL; 
if (ml != ra2) { 

UIMS2_WRiTE_ERROR( 
"Error atomis must be in the same molecule\n"); 
30 goto error; 

} 

if ^ < UTL_SET_CARDINALITY(atom_setl)) { 
UIMS2_WRITE_ERR0R( 
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"Error: Second argument must be no more than two atoms\n"); 
goto error; 

} 

a4 - a5 = -I; 
5 dCTi - UTL_SET_NEXT (atom_setl, -1); 

if (!(arec = SYB_ATOM_nND_REC (ml, dcm) )) goto error; 
a4 - arec->recno; 

if ((dem = UTL_SET_NEXT (atom^setl, elem) ) ! = -1) { 

if (!(arec = SYB_ATOM_FIND_REC (ml. dem) )) goto OTor; 
10 a5 = arec -> recno; 

} 

UTL_SET_DESTROy( atoin_setl ); 
atom_setl = NIL; 

} 

15 /* GENERATE Ihe paths */ 
/* set up paths */ 

if (!(a2chk = UTL_SET_CREATE( ml->iiiax_atoms + 1 ) )) goto eiior; 

if(!(nuls = UTL_SET_CREATE( ml->inax_atoins + 1 ) )) goto eiror; 

if (!(ciiats = UTL_SET_CREATE( ml->iTiax_atoms + 1 ) )) goto CTior. 
20 if (!(nxcn = yTL_SET_CREATE( ml->inax_atoms + 1 ) )) goto error; 

if (Kscratch = UTL_SET_CREATE( inl->max_atoms + 1 ) )) goto error, 

if (!syb_ingen_conn_att_atoins( a2chk, ml, al )) goto error; 

if <!UTL_SET_MEMBER( alchk, a2 )) { 
UIMS2_WRITE_ERR0R ( 
25 "Error: second atgumoit atom is not bonded to first argument atomAn"); 

goto error; 

} 

UTLSET_DELETE( a2chk. a2 ); 
a = -1; 
30 iq) = 0; 

while (np < MAX_NP && (a = UTL_SET_NEXT( a2chk, a)) > = 0) { 

if (!(p[np].path = UTL_SET_CREATE( ml-:>max_atoms + 1 ) )) goto error; 
p[np].root = a; 
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p[np].nrings = p[np].done ~ 0; 
UTL_SET_INSERT( p[np].path, a ); 

if (!(p[npl a = SYB_ATOM_FIND_REC (ml, p[np].n)Ot) )) goto error; 
np++; 

/* grow the padis */ 
growing = TRUE; 
nats - 0; 
ncycles = 0; 
10 while (growing ) { 
nuats = 0; 

ringclosed = FALSE; 

for (pnow = 0; pnow < np; pnow++ ) if (!ptpnow].done) { 
UTL_SET_COPY_INPLACE( cnats, p[pnow].path ); 
15 UTL_SET_CLEiyt( nxcn ); 

dem = -1; 

f* accumnulate this generation of attached atoms into nxcn */ 

while ( (elem = UTL_SET_NEXT( cnats, elem)) > = 0 ) { 
UTL_^SET_CLEAR( nuls ); 
20 if (!syb_mgen_conn_att_atoms( nuls, ml, elem )) return( FALSE ); 

UTVSET_.DEl£TE( nuls, al ); 
UTL_SET_DIFFJNPLACE( nuls, end_atoms, nuls ); 
UTL__SBr_OR_INPLAGE( nxcn, nuls, nxcn ); 
UTL_SET_DIFF_INPLACE( nxcn, p[pnow].path, nxcn ); 

25 } 

UTL_SET_ORJNPLACE( p[pnowl,path, nxcn. p[pnow].path ); 
if (toroot) if ((UTL_SET_MEMBER( p[pnow].path, a4 )) 

I J ( a5 > -1 && UTL_SET_MEMBER( ptpnowj.path, a5 ))) pLpnowJ.done 

TRUE; 

30 /* remove and mark ring closures when growing out */ 

if (Itoroot) for (pdone == 0; pdone < np; pdone4-+ ) if (pdone != pnow) { 
UTL_SET_ANDJNPLAGE( p[pnowl.path, p[pdone].path, a2chk ); 
if ((new_rings = UTL_SET,CARDINALITY( a2chk ))) { 
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/* we have ring closure(s) */ 

p[pnow].nrings += new_rings; 
plpdonej.nrings += new^rings; 
ringclosed = TRUE; 

UTL_SET_OR_INPLACE( end_atoms, a2chk, end^atoms ); 
/* if pdOTC < pnow, two bcandies arc now same lengths, drop common atom from both; 
but if >, branches are different, and must avoid rq>eated closing 
if (pdone < pnow) { 
/* remove alom(s) in the previous branch because paths are really same length ^/ 
UTL_SET_DIFFJNPLACE( p[pdone].path, alchk, p[pdone].path ); 
UTLJET_DIFFJNPLACE( p[pnow],path, a2chk, p[pnow].path ); 

} 

else { 

/* must identify and mark each atom in nxcn that is attached to a2chk atom */ 
elem = -1; 

while { (clem = UTL_SET_NEXT( a2chk, elem)) > = 0 ) { 
UTL_^SET_CLEAR( scratch ); 
if (!syb_mgen_conn_att_atoms( scratch, ml, elem )) 

retum( FALSE ); 
UTL_SET_ANDJNPLACE( scratch, nxcn, scratch ); 
UTL_SET_OR^INPLACE( end_atoms, scratch, end_atoms ); 

} 

} 

> 

} 

} 

/* done growing paths if no more atoms added to any path ♦/ 

for (pdone = 0, nuats = 0; pdone < np; pdone++ ) 

nuats += UTL_SET_CARDINALrrY( pCpdoneJ.path ); 

if (nuats < =nats && Iringclosed) growing == FALSE; 

nats = nuats; 
/* or after 100 atom layers out regardless */ 

ncycles-f+; 
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if (ncycles > = 100) growing = FALSE; 

} 

/* debugging */ 
if (debug) for (pdone = 0; pdone < np; pdone4-+) { 
5 sprintfC tempString, "Path %d (%d rings, from %d): 

pdone+1, pbxlone].nrings, p[pdone].nx^ ); 
UBS^OUTPUT_MESSAGE( stdout, tempString ); 
asfaaw( p[pdone].path, ml ); 

} 

10 /* compute the path properties */ 

for (pdone = 0; pdone < np; pdone+ +) { 

p[pdone].chosen - toroot && (UTL_SET_MEMBER(p[pdone].path, a4) 

1 1 ( a5 > -1 && UTL_SET_MEMBER( p[pdone].path, a5 ))); 
pfpdonej.nats = UTL_SET_CARDINALITY( p[pdpne].path ); 
IS p[jpdone].nrings = p[pdone].nrings ? 1 : 0; 

pbxlonej.mw = 0.0; 
p[pdone].(k}ne = 0; 

} 

/* return all root atoms* ordered test to worst */ 
20 for (p2do = 0; p2do < np; p2do++ ) { 

for (pdme - 0; pdone < np; pdone++) if (!p[pdone].done) { 
best = pdone; 
break; 

} 

25 for Oixione = 0; pdone < np; pdone++) if (!p[pdone].done && pdone != best) { 
if (!p[best].chosen && p[pdone]xhosen) best = pdone; 
if (p[best].chosen == p[pd(Hie]. chosen) { 
if (p[pdone].nrings && !p[best].nrings) best ~ pdone; 
else if ((!p[best].chosen && (p[pdone].nats > p[best].nats)) 1 1 
30 (pP>est].chosen && (p[pdone].nats < p[best].nats))) best = pdone; 

else if (p[pdone].nats == p[best].nats) { 
p[pdone].mw = getj>ath_mw( p[pdone].path, ml, p[pdone].mw ); 
p[best].mw = getj)ath_mw( p[best].path, ml, p[best],mw ); 
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if (p[pdone].mw > p[best].mw) best = pdone; 
else if (plpdonej.mw == p[best],mw) { 
/* checking relative geometries of attachments via "improper" torsion */ 
/* the phenyl rfher problem - if candidates are 180 degrees apart and we are on the 
5 root side of the torsion, pick the atom to the "right", not the "left", of the main chain */ 
if(toroot){ 
/* are we 180 ^>art? */ 

if (!( a4rec = SYB^ATOM_FIND__REC (ml. a4 )) ) goto error; ^ 
potl = UTL_^GEOM_TAU ( a4n5C->xy2. arecl->xyz, arBc2->xy2. 

10 p[best].a->xy2); 

pot2 = UTL_GEOM^TAU ( a4rec->xyz, arecl->xy2, arec2->xyz. 

p(pdone].a->xyz ); 

podiff = potl - pol2; 
while (podiff < 0.0) podiff + = 360,0; 
15 while (pot2 < 0.0) pot2 += 360.0; 

if (podiff < 190.0 && podiff > 170.0 && pot2 < 180.0) 
best = pdone; 

} 

if (best != pdone) { 
20 /• if not already set, according to the previous special case, then */ 

/* if torsions differ by 360 degrees then we have trans, prefer the +180 */ 

tl = UTL_GEOM_TAU ( p[pdone].a->xyz, arecl->xyz, arec2->xy2, 

p[best].a->xyz); 

12 = UTL_GEOM_TAU ( p[best].a->xyz. arecl->xy2. arec2->xyz, 
25^ — pipdone] .a- > xyz ); 

diff = tl - t2; 

if (diff > 355.0) best = pdone; 
else if (diff > -355.0) { 

while (tl < 0.0) tl += 360.0; 
30 if (tl > 170.0 && tl < = 350.0) best = pdone; 

} 

} 

} 
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} 

} 
} 

arec = SYB_ATOM_FIND_REC( ml, p[be5t].ioot ); 
5 simntf(tenipString/%d arecT>kl ); 
if(!(*Writar)(tonpSlrin£)) goto error, 
p[best].done = TRUE; 

) 

letval = TRUE; 
10 OTor: 
deanup: 

if( atoin_expJist) 

SYB_EXPR__DELETE_RPN_LIST( atom_expJisl); 
if(atoin_setl) 
15 UTL_SET_DESTROY(atoin_setl); 
if(end_atoms) 

UTL_SET_DESTROY(end_atoms); 
if(a2chk) 

UTL_SET_DESTROY(a2ehk); 
20 if(nuls) 

UTL_SET_DESTROY(nuls); 
if(nxcn) 

UTL_SET_DESTROY(nxcn); 
if(cnats) 

25 UTL_SET_DESTROY(cnals); 
if(scratch) 

UTL_SET_DESTROY(scratch); 
retum( retval ); 

} 

30 static int syb_ing«i_conn^att_atoins( aset, m, atid ) 
ors atoms attached to atm into aset V 
/♦ WORKS STRUCTLY WITH RECNOS */ 
^tj)tr aset; 



wo 97/27559 PCr/US97/01491 

135 

inol_ptr m; 
int add; 
{ 

at(Hnj>trat; 
5 List_Ptrtohs; 
atomj)tr tt*; 
aoonj)trconnl; 
unsigned nbytesl; 

at = SYB_ATOM_FIND_REC( m, add ); 
10 tohs = at->conn_atom; 
while (lohs) { 

tohs = UTL_UST_RETRIEVE_P( tohs, &connl, Anbytesl); 
toh = SYB_ATOM_nND_REC( m, conn l-> target ); 
UTL_SETJNSERT( aset, toh->recno ); 

15 } 

retum{ TRUE ); 

} 

stadc float get_jpalh_niw( aset, m, mw ) 
/* returns the total atomic weight of all atoms in aset 
20 setjptr aset; 
naolj)tr m; 
float mw; 

intdem = -l; 
25 float ans = 0.0; 
atomj)trat; 
if (mw) retum( mw ); 
dem = -1; 

Miule ( (elem = UTL_SET_NEXT( aset, dem)) > = 0 ) { 
30 at = SYB__ATOM_FIND_REC( m. dem ); 

ans += (float) SYB_ATAB_ATOMIC_WEIGHT( at- > type); 

} 

retum( ans ); 
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} 

Static void ashow( aset, m ) 

/* for interactive dd>ugging, shows a set's membership in terms of atom ID */ 
s^j)tr aset; 
S mol^ptr m; 
{ 

diar bufi[1000], *b', 
atomj>tr at; 
int dem; 
10 *buff = AO'; 
b = buff; 
dem = -1; 

while ( (dem = UTL_SET_NEXT( asd, dem)) > = 0 ) { . 
at = SYB_ATOM_FIND_REC( m, dem ); 
15 sprintf( b, " %d", at->id ); 

b = buff + strien( biiff ); 

} 

sprintf( b. "\n- ); 

UBS_OUTPUT_MESSAGE( stdout, buff); 

20 } 
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/* HHSn^-NIKG CP SUBROUTINES l-D. Calculation of attenuated fields ♦/ 
/ * : QSAR_FI£LD_EVAL_RB_ATTEN { ) * / 

• • 

/♦ inr QSAR_rrrLD_EVAL_RB_ATTEN( moio. stfldp, elfldo. reao. no_st, no_el, ctp ) 
/* . . 

/* nick Cramer May 13, 199S */ 
/* */ 

A-ie 
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-Standara CoMFA- except that the contribution of any atom 
to the field falls off with an inverse power of its distance 
from a root atom, measured in NTO4BER OF ROTATABLE BONDS! 

This means also that each individual atom's contribution 
has a similarly scaled upper bound, rather than checkxng 
the upper bound only for the sum over all atoms. 



/* This procedure computes vdW 6-12 steric values at each point i a region */ 

/* and the electrostatic interactions (initially assuming 1/r dielectric) . V 

/♦ NOTE:: initially ignoring space averaging,- other user knobs. */ 

/* note:: assuming valid input here; error checking higher up ! ♦/ 



/ 

/* Input: 



/* holp ' molecule pointer, molecule to place in region. / 

/* stfldo . steric field pointer, where values will be piaceo. V 

/* elfldb - electrostatic field pointer, where values will be placed. ♦/ 

/♦ regp * - region pointer, locations where values are to be evaluated. ♦/ 

/* no st - flag to skip steric evaluations */ 

/♦ no~el - flag co skip electrostatic evaluations / 

^t.^ _ n^mf =iTr\rsVr r fnr dllltimv/lD ValueS ' 



/♦ ctp - ComfaTopPtr, for dtimmy/lp values 

*/ 

/♦ Returns 0 on failure, 1 otherwise. ^ ^' 

'r. ■ • • ' 

/ * +E : QSAR_FIELD_EVAL_RB_ATTEN ( ) * / 

int QSAR_FZELD_HVAL_RB_ATTEN ( molp, stfldp, elfldp, regp , no_st, no_el, ctp) 

mol_pcr molp; 

FieldPtr stfldp, elfldp; 

RegionPtr regp; 

int no_st, no_ei ; 

CcxfaTopPtr ctp; 

{ 

BoxPtr box; 

atomjtr at , syB_ATOM_FIND_ID 0 ; 

int Did, b, ix. iy, iz, nat, vol^avg, repulsive ; 

fpt *steric. *elect, SYB_ATAB_VDW_RADII 0 ; 

fot diff, dis, dis2, x, y, 2. sum_steric, sum_elect ; 

fpt dis6. disl2 , repuls val, offs(9] [3] , atm_ste, atm_ele; 

fSc *charge, *ctemp, *coord, *ftemp, *wt, scale_vol„avg, atm_steric, atm^elect; 
inc *atyp , *itemp, dohbd, dohba, ishbd, retval, dielectric . off, atid; 

static fpt hbond seal; 

fpt hbond A, hbond B, ♦AtWts « NIL, ♦QSAR_FIELD_RB_WTS ( ) ; 

int ♦HAs,"*HDs, *HAp, *HDp; /* sets would be more efficient but slower */ 

int do steric, do_elect; 

set_ptr hdonor, SYB_HBOND_DONORS ( ) , pset = NIL, aset = NIL; 

Mefine Q2KC 332.0 

^define MIN_SQ_DISTANCE l.Oe-4 

/* any atom within 10-2 Angstroms is hereby zapped ! 

this is about it: 10^6 / 10^-24 is close to overflow! */ 

ftemp = NIL; ctemp - NIL; itemp - NIL; retval = FALSE; HAS = NIL; HDs = NIL; 
hdonor = NIL; 

/♦ fcr now, make root atom the one closest to 0,0,0 ♦/ 
for <nai: = 1; nat moip->natoms; nat++) { 

A- 19 
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ac - SYB ATOM FI1ID_IDC molp, nat }; 

dis2 = at->xyz[OJ * ac->3cy2{0) + at->3cy2(l3 * at->xy2(l] ♦ 

at->xyz(2J * ac->xyz(21; 
if (nat »- X I I dis2 < dis) { 
dis - dis2; 

acid * nac; v. 

/♦ following is specific to topomeric fields */ 
if (KAtWts » QSAR_FIEIib_RB__MTS( molp, atid ) )) goto cleanup; 

/ 

if (lno_el) 
{dielectric « elf ldp->dielectric ; 
vol_avg = elf ldp->vol_avg_type; 
scale_vol_avg =' elf Idp- >scale_voi_avg; 
repulsive « elf Ido- >repulsive; 

repuls_valfrepexp [repulsive) ; elect - eifldp -> f ield_value; } 
if (Jno_st)_ 
{vol_avg « scf Idp- >vol_avg_type; 
scale_vol_avg « stf Idp- > seal e__voi_avg; 
repulsive = scf ldp->repulsive; 

repuls_val«repexp (repulsive) ; steric - stfldp -> f ield_value; } 

if (Hftemp « (fpt ♦) UTL_MEM ALL0C(3*sizeof (fpt) *molp- >natoms> ) ) goto cleanup? 
if (Mctemo « (fpt *) UTL_MEm2aLL0C ( sizeof (fpt) *molp- >natoins) ) ) goto cleanup; 
if (!(itemp = (int *) UTL_MEM_ALLOC ( sizeof (in t) *molp- >natoms) ) ) goto cleanup; 
if (MHAs « (int *) UTL__MEM_ALIjOC ( sizeof ( int) *molp- > nat oms) ) ) goto cleanup; 
if (!(HDs - (int *) UTL_MEM_ALLOC { sizeof (int ) ♦molp- >natoms) ) ) goto cleanup; 
/* get just those H's which are ^capaible of Hbonding */ 
if (Mhdonor « SYB_HBOND_DONORS < molp, NIL ) )) goto cleanup; 

for (ccord»ftenip,atyp*iteinp,charge==cteinp,HAp=HAs,HDp==HDs, nat=l; 
nat<=molp- >natoins ; nat>+ ) 
{ if (NIL "(at « SyB_ATOM_FIND_ID(molp, nat) ) ) goto cleanup; 
♦coord++ - at->xyz(0] ; 
*coord++ = at->xyz(l]; 
♦coord++ « at->xyz(2); 
♦atyp*+ = at->type -1 ; 
•charge* + « at->charge; 

♦HAp++ « SyB_ATAB_HBOND_ACCEPT(at->type) ; 
*HDp++ - UTI,_SET_MEMBER(hdonor, at->recno) ; 

for (b»0; b<regp->n_boxes ; b++) { 
box a & regp- >box_array (b) ; 

dohbd - (SyB_ATAB_ATOMIC_NaMBER( box- >atom_ type) 1) && 

(box->pt_charge =- i.O); 
dohba - (SYB_ATAB_ATOMIC_NUMBER( box- >atom_cype ) ~ 8) && 

(box- >oc_charge «- -1.0); 
if (dohbd I r dohba) { 

if (!TAILOR_STORE_IT_HERE( "TAILOR! FORCE_FIELD!HBOND_RAD_SCALrNG" . 
^hbond_scal,"l) ) goto cleanup; 

hbohd^A « pow (*hbond_scal , 6.0 ) ; 

hbond B = hbond A * hbond_A; 

} ■ " 
. if (vol_ava) 

0SAR_FIELD_EVAL_GETOFF (of f s . box- >stepsi2e , vol_avg , scale_vol_avg) ; 
if ( !no_st ) 

OSAR_FIELD_VDWTAB ( box -> atom type, repuis_val, ctp- >du_lp_steric ); 
for (i2«0. z«box->lo(2] ; iz < box- >nstep(21 ; iz++, z box->stepsize{2] ) 
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nat <molp - >na toms ; 
nat++, wt++) 

^xf ( ( *atyp DUMMY-l II *atyp - LP-l ) ' ^^PT^^^rr^lPYn^tf * 

*charge = 0.0; /* set charge to 0 since ignorxng Du/lp */ 
if (!voI_avg) /♦ the "normal" case ♦/ 

dis2 " X - * coord* + ; 

dis2 dis2; 

diff - y - ♦coord++ ; 

diff ♦« diff ; 

dis2 diff ; 

diff » 2 - *coord++ ; 

diff diff; 

dis2 diff; . i ; 

if ( !no_el && elfldp->zap_el==2 do_elect) I 

dis = sqrt ( dis2 ) ; . ^ , i » / 

if ( dis < SYB_ATAB_VDW_RADII ( ♦atyp + 1 ) ) I 
/• no shortcircuics! */ 
/♦ 

♦elect++ « 0.0? 
do elect « FALSE; 
*/ ^ " 

if ( dis2 < MIN_SQ_DISTANCE ) { 

/**if"ltoi has no steric value, we don't care about 

MIN SQ DISTANCE since it has no contribution anyway */ 
if ( vdw a(*atyp] !- 0.0 && vdw_b(*atyp] i= 0.0 ) { 

/* set'sterics to its max value at current grid pt. */ 

atin_steric - (♦wt) * stf Idp- >inax_value; 

if ( Ino^el 6& do_elect> { «i i / 

if ( Tno_st && !do_steric && elf ldp->zap_el ) [ 
♦elect+r « DAB_F_MISSING; 

else if ( *charge != 0,0 ) { 
if ( ♦charge > 0.0 ) 

atm elect = {*wt) * elf ldp->roax_vaiue; 
else itin_elect - (*wt) * .elfldp->max_value; 

if ( !do elect && !do_steric ) _ 

break;" /* break out of loop since neither el. ox st^. 

need to be calculated for this grid point */ 

/♦ setting dis2 to l (an arbitrary no.), will prevent a zero 

. divide in the sum^steric or sum.elect calculations below */ 
dis2 1.0; 
} 

if ( ! no_st && do_steric ) { 
dis6 - dis2 ♦ dis2 * dis2 ; 
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disl2= dis6 * dis6 ; 

''dfiri'(^lUsive.=l) 7 disl2 / dis2 : disl2 / dis2 / dis2; 

" **^°^'^t:^'- hbond.B . vdw_bC;atypl/disX2 - ' 
~ hbond_A * vdw_a[*atypl/dls6 ; 

else i£ (dohba && *HDp) w r.=«^i /Hi ets - 

atm sceric = hbond_B * vdw_b(»atypj/dxsl2 
hbond_A * vdw_a(*acypl/dis6 ; 

^^^%t«_8teric - vdw_b(*atypl/di8l2 - vdw_a (*atypl /dis6 ; 
^:;^eSr:''at«._sceiic > stfldp->max_value ? stf ldp->max_value 

: atm_&teric; 
atm_sceric ♦= (*wt;) ; 
if ( ! no_el SlSl do elect ) ( 

acm.eiecc = *<=^Yilleizric 7 sqrt(dis2) : dis2 ) ; 

acm elect = atmlelect > elf ldp->max_value ? elf ldp-><nax_value 

ac«~eiect"='"slm!e^ect < - (elfldp->max_value) 7 - (elf ldp->max_value) 

: acm_elect; 
atm_elect *« (*wt) ; 
sujti elect +a atm_elect; 

} " 
atyp++; 

sum steric +« atin_steric; 

} " 

else 

{ for (of f=0;off <9;of f ++)" 

i 

coord ^=3; 
atyp ; 

charge ; - 

HAp ++ 

HDp ++ ; 

} /* atom loop */ 
dbnea corns : 

if ( do sceric 1 ] do_elect ) { • / o a . \ 

if (v3l_avg) { sum^elect /« 9.0; sum__sceric /= 9.0 , ) 
if ( !no el && do elect ) 

else if ( ♦elect < - elf ldp->max_value ) *elect = 
- elf ldp->insuc_value; 
transf orm_field (elf ldp->max_value, elect, ctp) ; 

■elect ++; 

if ( rno_st && do_steric ) 
/ * steric » sum_steric ; 

if ( *steric > stf Idp- >max_value) 
{ * steric = stfldp->max value; i 
^ "no.el && elfldD.>zap_ei»«l ) Melect-1> » DAB,F_MISSING; } 
transf orro.f ield (stf Idp- >max_value , sterxc , ctp) ; 
steric ++ ; } 

} /* points in box loop */ 
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} /♦ boxes loop */ 

recval « TRUE; 
cleanup: 

if ("iten?)) UTL_MEM_FREE { iterap) ; 
if ( ften?)) UTL_MEM_FREE ( f tea?>) ; 
if ( cceinp) UTL2mEM_FREE ( cten?)) ; 
if (HAS) UTL_MEM_FREE ( HAS ); 
if (HDs) UTL^MEM_FREE ( HDs ); 
if (hdonor) UTL_SET_DESTROY ( hdonor ); 
if (AtWts) UTL_MEM_FREE ( AtWts ); 
if <psec) UTL_MEM_FREE ( pset ); 
if (asec) UTL_MEM_FREE { aset ) ; 
return retval; 

Suhdef Q2KC 

Sundef MIN SQ DISTANCE 
} " " 

/* 

scacic fpt *QSAR_FIELD_RB_WTS ( tnolp, roocid ) 

/* generates rocational -bond wcs for each atom */ 

moljtr molp; 

int root id; 

{ 

/♦ pseudo code for FIEIiD_RB_WTS ( ) 

while saw new atoms 

uncover atoms that stopped last shell growth 
grow next "rotational shell" 
while adding to shell 

for each atom in shell 
get neighbors not seen 
for each neighbor 

if bond is rotatable (acyclic, >1 attached atom, net a,am,ti) 

cover all other atoms attached to atom for this shell 
add it to shell 



NIL, ♦vals « NIL. factor, nowfact =^1.0; 
found, aggcount, at id, aggid. loop, size; 

aggacs = NIL. allats « NIL, nuls « NIL, endatms = rciL, end_cands 
root, SYB_AT0M_FIND_REC { ) . at, atrec ; 
b . SyB_BOND_FIND_REC ( ) ; 
toats , OTL_LIST_RETRIEVE_P.( ) ; 
cpcr; 
teropString(200] ; 

ashow ( ) . qsar_f ield_attached_atoms ( ) ; 

if (1( vals « (fpt ♦) UTL MEM_ALLOC( sizeof (f pt ) *molp- >natoms) ) ) recurn( NI 
if { ! UIMS2_VAR_GET_TOKEN ( ""TAILOR ! COMFA ! AGGREG__DESCALE " , 

&f actor ) ) return ( NIL ) ; 
if (! (allats « UTL_SET_CREATE( molp- >max_atoms + 1 ) )) goto cleanup; 
if f:(aggats • UTL_SET_CREATE( molp->max_atoms + 1 ) )) goto cleanup; 
if (linuls «» UTL_SET_CREATE( molp->max_atoms + 1 ) )) goto cleanup; 
if i'i (endatms « UTL_SET_CREATE ( molp- >max_atoms + 1 ) )) goto cleanup; 
if IMend_cands = UTIj_SET CREATE ( molp- >max_atoms + 1 ) )) goto cleanup; 
if (;i root = 3YB_AT0M_FIND_REC ( molp, root id ) )) goto cleanup; 
UTL_3ET__IKSERT ( aggats, root->recno ); 
UTL_SET_INSERT{ allats, root'> recno ); 
aggcount « loop • i; 
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fpt *ansr = 
inc 

set^jp^r 
atom_prr 
bondjptr 
List_Ptr 
aconjptr 
char 
void 



wo 97/27559 PCT/US97/01491 

143 



while (TRUE) { 

while (TRUE) { 

while naggid - UTL_SBT-NEXT{ allats, aggid )) >= 0 } J 
UTIi SET CLEAR( nuls ) ; \ . 

csar field attached_acoms ( nuls, molp, aggxa 
UTL SET DIFF_XNPLAC:e ( nuls, allats, nuls I; 
UTL3SET_DIFF_INPIiACE ( nuls, endatms. nuls J; 
/* identifying any atoms that terminate this aggregate ♦/ 

while"{*(atid = UTL SET_NEXT( nuls, atid f" ^ > ( , 

if (!( at « SYB__ATOM_FrND_REC( molp, atxd ) )) goto cleanup; 

/* skipping monovalent atoms */ 

if (at->nbond > 1) { 
/♦ find bond record that attaches to aggid ♦/ 

toacs « at- >conn_atom; 

foiaid - FALSE; 

while (toats && ! found ) { . 

toats = 0TL_LIST_RETRIEVE_F< toats, &cptr. isize ); 
found = (cptr-> target aggid ); 

if ('found) goto cleanup; 

b " SYB BOND FIND REC (molp, cptr- >bond_rec) ; 

if ( t(5->st;tUS & BOND V IRING) && !(b->StatUS & BOND_V_ERI 
&& (b->type~" SyB_BTAB^MNEM_TO_TYPE(-lM ) ) { 
/♦ have an end -of -aggregate atom, mark as end atoms all other attached atoms */ 

UTL SET CLEAR { end_cands ) ; 

qsar_f ield_attached_atoms ( ehd_cands, mclp, ac->recno ); 
UTL SET_DELETE < end_cands , aggid ) ; 
UTL_SET_dR_INPLACE( endatms, end_cands. endatms ); 

} 

} . 
UTL_SET_Cx_IKPLACE( aggats. nuls, aggats ); 

If (UTL SET_CARDINALITY( aggats ) <= aggcount ) break; 
aggcount = UTL_SET_CARDINALITY ( aggats ) ; 
UTL_SET_OR_INPLACE ( allats, aggats, allats ); 

7* debugging stuff . . */ 
/* 

sprintf:( tempString, "Aggregate %d (weight « %f loop, nowfact ); 

UBS OUTPUT_MESSAGE ( stdout, tampString ); 
ashow( aggats. molp ); 

/* if no atoms added, we are done! */ 

if (UTL SET_EMPTY ( - aggats )) break; 
/♦ record scaling factor for atoms in this aggregate ♦/ 

atid « -1; , 
while ((atid » UTL_SET_NEXT ( aggats. atid )) >= 0 ) ( 

if (!(atrec « SYB ATOM_FIND_REC( molp, atid ))) goto cleanup; 
vals( (atrec->id)"l ) « nowfact; 

UTL„SET_OR_INPLACE( allats. aggats. allats ); 

UTL_SET_CLEAR( aggats >; 

UTL_SET_CLEAR( endatms ); 

aggcount =0; 

nowfact ♦= factor; 

loop++; 
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ansr « vais ; 



clecinup: 

if '(aggacsj UTL_SET_DESTROY ( aggats ); 

if (allacsi UTL_SET_DESTROY ( allats ) ; 

if (endatxns) UTL_SET_DESTROy ( endacms ); 

if (end^cands) trrL_SET_DESTROY ( end cands ); 

if (nuls) UTL_SET_DESTROY( nuls ); 

recurn ( ansr ) ; 



scatic void qsar_field_accached_a coins ( asec, m, acid ) 

/* ors acoms actached to acm inco asec */ 

/♦ WORKS STRUCTLY WITH RECNOS */ 

secjocr asec; 

mol_pcr m; 

inc acid; 

{ 

acom_pcr ac, SYB ATC»1_FIND_ID < ) ; 
Lisc_Ptr cohs, UTL_LIST RETRIEVE_P ( ) ; 
acom_j)cr coh, SYB_ATOM_FIND_REC ( ) ; 
acon_pcr connl; 
inc nbycfisX; 

ac = SYB_ATOM_FIND_REC ( m, acid ) ; 

coiis x= ac->conn_acom; 

whiie (cohs) { 

cons - UTL_LIST_RETRIEVE_P( cohs, &connl. &nbycesl) : 
coh « SYB_ATOM_FIND_REC ( connl - >cargec ) • 

^ UTL_SET_INSERT( asec, coh->recno ); 

recurn ; 

} 



scacic void ashow( asec, m ) 

/* for inceraccive debugging, shows a set • s membership in cerms of atom ID 
sec ocr asec: 



secjpcr asec; 
mol Dcr m; 
( 

char buff [1000], *b; 

acom_pcr ac. SYB_ATOM_FIND_REC ( ) 

inc elem; 



♦buff * '/O'; 
h « buff; 
elem = -l; 

while { (elem « UTL SET_NEXTJ_aset , elem)) >- o ) 
at - SYB_ATOM_FIND_REC( m, elem ); 
sprintfi b, %d", ac->id ); 

^ b o buff - strlen( buff ); 

sprincf i b, "Xn" ) ; 

UBS_OUTPUT_MESSAGE { scdouc, buff }; 
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tt "besc" triangle, e.g 

©expression^generator LRT^FAST 

I ''lr?!fast rows descriptor.cols bio_col {pis flags like scaling in quotes] 

; " ITslAllor ITs '°wS?ch colun^ are the -ighborh^^^^ 

I bio col --which column has the bio (probably ^^^^^^^^^^^ 

I {..7] - if need to SCAL NONE or anything like chat, do it here 

# 

if returns a line of the form ^:qqq 
if 3.09691 / 0.000546509 = 5666.71 - 496 : 496 :: 15.6981 . 15.6989 

# ^ max bio difference 

# " optimal distance division for max bio 
H ' * slope 

i ^number in the Irt 

^total number 

^ "^area in the Irt 

J "^total area 

# Significance is related to whether ratio of numbers is 

# much above ratio of areas. 



global var SAMPLS_IN_PROGRESS D0NE_CHECKED_OUT 
local var hold distname rows cols bio 



secvar rows ioromotif ("$1- ROW^EXP -Rows to use ' . 

setvar cols %promptif (-$2« COL^EXP -COMFA*" "Columns of mol descriptors ) 

setvar bio Vpromptif ( "$3 • COL^EXP "LOGBIO" "Column of bio data ) 

setvar hold SAMPLS_IN_PROGRESS 
setvar SAMPLS_IN_PROGRESS $bio 

setvar distname TAILOR ! HIER ! DIST_FNAME 
setvar TAILOR !HIER!DIST_FNAME lrt_fort.3 

# here the information is coii?)Uted and written to a file 

# whose name is-passed in via a TAILOR value 
QSAR A>IA DO I >$NULLDEV $rOWS $COls HIER 54 | 

setvar SAMPLS_IN_PROGRESS $hold 

setvar TAILOR! HIER !DIST_FNAME $distname 

# contents of the file are returned to the caller 
setvar hold tsystemC "cat lrt_fort.3") 
*retum( «$hbld" ) 



# 

# Section II-B. SPL script for computing the significance of the distribution 

# found by lrt_fast 

n 

@exDression generator dochi ^ . i_ ^ 

If cbmoutes Ehe chi-souare statistic for t.he number of pojnts below 
the* diagonal , null 'hyp theses being the area fraction of the total. 

u 

I To be called as: %dochi ( %lrt_fast( ) i.e.. its inputs 
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ft are exactly the output of %lrt_fast as described ia the lrt_fast header 

n 

setvar expected VmathC $9 • $11 / $13 ) 
setvar sq tmath( $7 - $expected ) 
setvar sq %inath( $sq ♦ $sq / $expected ) 
*retum( $sq ) 



/♦ Section II-C. Computes the best diagonal in the "virtual graph" of biological 
distances vs property differences. */ 

int QSH2LL_HIER_LRT ( table , biocol , dmat . nrow, order, Imsg ) 
char * table ; 

int biocol, /.* column in MSS with biological data ♦/ 

nrow, . /* dimension of dmat and order */ 

♦order; /* array of row IDs to consider ♦/ 
fpt *dinat; /* distance matrix for property distances */ 
char *lmsg; /* file name for results */ 

fpt *p. *q, fabsO, bmax; 
int i,j, count, status_array ; 
char ♦fpt_coiname; 
FILE *out , ' *UTL_FILE_FOPEN ( > ; 

/♦ need to get the bio values 

In the n*2 we can repack into n(n-l)/2 then add the n bio values 
and finish wich the bio distances */ 

/* 

No error handlina. Better be data in those rows! 
♦/ " 

for icount^O, i=0; i<nrow; i++) 
tor (j«0; j-*) 

dmat (count++j = dinat(i*nrow + jj; 

q » p = dmat ♦ ( Carow-l) * nrow) / 2; 

TBL_ACCESS_INDEX^TO_COLMAME( table, biocol-1, &fpt_colname) ; 
TBL_GRAB_INIT_FPTSaable. 1, &fpt_colname ); 
for ( i«0; icnrow; i++ , p++) 

TBL_GRAB_GET_FPTS_INV t order (i] -1, &status_array, p) ; 
TBL_GRAB_COMPLETE_?PTS () ; 

bmax « 0.0; 

for {count=0. i=0; i<nrow; i++) 
for <jaO; j<i; j count++) 

if ( (plcountl = fabstqCiJ - q[jl)) > bmax) bmax « pCcountl; 

out = UTL_FILE FOPENdmsg, -W) ; 
-QSHELL_HIER_DO"lRT (out . count , dmat . p , bmax) ; 

UTL FILE FCLOSE(ouc); 
} ■ " 
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fIle^^ScP^ y^^^^*^' ^ > 

fpt 'xsorc. Mysore, bznax; 
inc index; 

int border, coiinc, j, i, bad; 
int bescN, besci; 
fpt den /best Den; 

«define CUTOFF ( bnax - ( xsorc (order ( i] j / xsortjorder (j] ] ) ) 

Jor';i°o"fniaci^?L./;^j5^^ index .si.eof.int ,„, re cum o 
bestN = besti = bad » 0; ' 
bestOen « 0.0; 

fpt_heapsort (index, xsort, order); 
for (j«0;counc«0. bad-0, j<index 

if (xsorc [order (j) 3 o.O) continue; 

for ii«0;i<=j;i^.^) 
1 

iise''^°" (orderdJ ] <- CUTOFF) count**; 
I /* loop over all d <= this dist^ce*'' ./ 

den = bmax * xsort (order (index- 111 • 
sprint£(msg..%g / *g . ,g . ^ 

torder (bestl] ] . bmax/xlort (orderfbestH 1 

uBs_ooTPu?"^?s^^TSurs3sr°" '"^"^^ ' 'LsiTi^] : 

OTI,_MEM FREE (order » ; ^ ^ 
return i; 

) 



A-28 



W097/27»9 



148 



PCT/US97/01491 



• n is number of eiemencs 

arrin is array of floats co be sorted 
max is array of ints initially 0...n-l 

inn fpt_heapsort(n, arrin, indx) 
int n; 

fpt »arrin; 
iat ♦indx; 
« 

*int 1. ir. indxt. i. j; 
fpt q; 

1 « n/2 : 
ir « n -1 ; 

while (TROE) /* the -lO* loop */ 

if^a>0) { incixt « indx{--ll; q - arrin ( indxt 1 ; } 
{ 

indxt - indxfirj; q = arrin f indxt 1 ; 
indxtir--! - indx{0| ; 
if ( ir »• 0 ) 

^ { indx(0] . indxt; return X; } /* only way out ; */ 

i - 1; 
1 = 1; 

j « 1 ♦ 1 +1; 

while (j <« ir) /* the -20*' loop ♦/ 

If J^^3<ir) (arrin{indx[j]l < arrin{indx( j +ij ] ) ) j^^ . 
If (q < arrin(indx[jJ3) indx(i} . indx(j3;'i 1 j; jl^.^; j 

J V 3 " xr+1; j ' 

^ indxfil = indxt; 
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tne C CMS runctions shown in Sections I and :z. */ 
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/* 
/* 
/♦ 
/* 
/* 
/* 
/♦ 
/♦ 
/* 



Molecule and Supporcing Structure Definitions 

John McAlister 09 -Aug- 1985 

This file contains the definitions for the molecular data struc- 
tures required within SYBYL. The contents of this file are des- 
described in detail in the document "SYBYL Molecular Data Struc- 
tures". 



*/ 
V 
*/ 
*/ 
»/ 
*/ 
♦/ 
*/ 
*/ 



Def-ine the molecule descriptor template 
typedef struct molecule^struct {* 



char 
132 

List_Ptr 

132 

char 

staim 

staso) 

int 

int 

List_Ptr 

int 

int 

List_Ptr 

int 

int 

List_Ptr 
Lisc_Ptr 
int 
int 

liist_Ptr 

int 

int 

List_Ptr 
int " 
fpt 
fpt 

List Ptr 



►name; 
type; 
diet ; 
status; 
►comment; 
cre_time; 
mod_cime; 
max_props ; 
nprops ; * 
props ; 
maoc^f eats ; 
nfeats; 
feats; 
max_subst; 
nsubst; 
subst; 



/* 
/* 
/♦ 
/♦ 
/* 
/* 
/* 
/• 
/* 
/* 
/* 
/* 
/* 
/ 
/ 
/ 



pointer to molecule name 
molecule type 

list of dictionaries used with molecule 
molecule status 

pointer to comment for molecule 
creation time/user/version stan^ 
modification time/user/version stamp 
maximum properties currently allocated 
nuniber of molecular properties 
pointer to list of properties 
maximum features currently allocated 
number of molecular features 
pointer to list of molecular features 



*/ 
*/ 
*/ 
*/ 
*/ 
♦/ 
♦/ 
♦/ 
♦/ 

'I 



subst_roots; 



max_atoms ; 
natoms; 
atoms ; 
max_bonds ; 
nbonds; 
bonds; 
charges ; 
vector (3] ; 
matrixOi ; 
assoc data; 



molecule. *mol otr; 



/ 
/ 
/ 

/* 

/♦ 

/♦ 

/* 

/* 

/* 
/* 
/* 



m aximum substructures currently allocated*/ 
number of substructures in molecule 
pointer to list of siibstructures 
/♦ pointer to list of root subst offsets 
maximum atoms currently allocated 
number of atoms in molecxile 
pointer to atom array segment list 
maucimum bonds currently allocated 
number of bonds in molecule 
pointer to bond array segment list 
typ>e of atomic charges, if present 
translation vector for molecule 
rotation matrix for ^lolecule 
pointer to list of associated data 



descriptors 



*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
V 
*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
*/ 



/** 
/♦ 

7* Define the atom entry record 
typedef struct atom struct 



ATOM DEFINITION 



char 


♦name; 


/* 


int 


type; 


/♦ 


132 


status; 


/♦ 


int 


recno; 


/♦ 


int 


id; 


/♦ 


int 


link; 


/* 


int 


subst; 


/* 


List_Ptr 


property ; 


/♦ 


List_Ptr 


feature; 


/* 


int ' 




/* 


nbond; 


/♦ 



template 

{ 

atom name 
atom type 
atom status 

cumulative atom record number 

atom id (logical atom number) 

link to next atom record 

offset to substructure containing atom 

pointer to list of properties for atom 

pointer to list of features including 

this atom 
number of bonds involving this atom 
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-ist_Pcr conn^acom; pointer co list of bonded acoms ♦/ 

rpn xyz[3J; /♦ coordinaces of atom ♦> 

tpc charge; /* point charoe on arom 



charge; /* point charge on atom 

i atom. *atorajtr; 

/♦ Define the atom array segment descriotor template 
typedef struct atom_seg_scruct { 

acomjptr seg^head; /• pointer to head of atom array segment 
moljptr molecule; /* pointer to molecule containing atom seq 
int max^atom; /* m axi m u m number of atom records in seq 

mt natom; /♦ number of filled atom records in seg 

mt used.atom; /* offset to first filled record in segment 

lat f ree^atom; offset to first free record in segment 

) atom_seg, ♦asegjtr; ^*«="i- 

/* Define the bond specifier records pointed to by the atom records 
cypedef struct atom_conn_struct { 



/ 



*/ 
V 
*/ 
*/ 
*/ 
*/ 



int target: /* offset to target atom •/ 

int bond^rec; /* offset to bond descriptor record ♦/ 

^ atom_conn, *acon_ptr; ' 
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BOND DEFINITION 

/* ♦/ 

/♦ Define the bond entry record template ♦/ 
typedef struct bond_struct { 

int type; /* bond type ♦/ 

132 status; /* bond status ♦/ 

int recno; /* ciimulative bond record number ♦/ 

int id? /♦ bond id (logical bond number) ♦/ 

int link; /* link to empty bond record ♦/ 

List_Ptr property; /* pointer to bond property list ♦/ 

List_Ptr feature; /♦ pointer to list of features including ♦/ 

/• this bond 

int o_subst; /♦ offset to origin atom substructure */ 

int origin; /* offset to atom at bond origin ♦/ 

inc t_s\ibst; /* offset to target atom substructure ♦/ 

int target; /♦ offset to atom at bond destination */ 

} bond, •bondjtr; 

/♦ Define the bond array segment descriptor template */ 
typedef struct bond_seg_struct { 

bond__ptr seg^head; /♦ pointer to head of bond array segment */ 

jpoij^i" molecule; /♦ pointer to molecule containing bond seg •/ 

int max^bond; /* maximum number of bonds in segment •/• 

int nbond; /♦ number of filled bond records in seg •/ 

int used_bond; /* offset to first filled record in segment ♦/ 

int .*free_bond; /* offset to first free record in segment ♦/ 

] bond_seg, ♦bseg_ptr; 
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/* comfa.h ♦»♦♦♦/ 

/* Regions are the sec of points at which energy evaluations are made *^ 
/* in the Cc^A method of ncao n ^„ ^- ^.^^Jz^^ ",^^^ 



♦/ 
V 



/* at a °5 ^ ^^^^^^ " <l«"^^d as the union 

/* °^mirf ^'^^^*=^ ^« ^ si'^S^e P«=i^t in the */ 

/* c;Sa ™™« ^^"^ associated attributes. Attributes needed for */ 

CoMFA purposes are outlined below. 

/ */ 

» if nde f QSAR_COMFA_DEF2KmONS 

#def me 0SAR_CC»4FA DEFINITIONS 1 

If include "ta_ types .h" 

ftdefine LP 20 /* lone pair atom id */ * 

typedef enum { 
FDENGY^UNKNOWN . 
FDENGY_ELECT, 
FDENGY_STERIC, 
FDENGY H<»«0, 
FDENGY^LUMO, 
DOCK_ELECT, 
. DC)CK_STA_NOHB, 
POGK_STA\HBD, 
DOCK STA~HBA, 
DOCK STB_NOHB , 
DOCK*"STB_HBD. 

DOCK~STB_HBA } FldEngyTyp; 

typedef enum ( 
FDHD ORIGINAL, 
FDHD_FFIT, 
FDHD_XTERN, 
FDHD FONC, 
raHD^USER, 
FDHD_USR_AVG , 
FbHD^DOCK, 
FDm)_AVG, 
FDHD^SIG, 
FDHD MAX. 
Fbm^MIN, 
FDHD COEFF. 
FDHD~AVG_X, 
FDHD_SIG_X, 
FDHD FLD X. 

fdhd"range, 
fdhd^pi^ xwt. 
fdhd_pls"xload, 
fdhd_fac_load, 

FDHD"FAC_COr^. 
FDHD^FAC_ROTLOAD , 
FDHD SIMCA_LOAD, 
FDHD_SIMCAJ40DEL , 
FDHD_SIMCA 3ISCRIM 
FDHD^HBD J rldHowTyp; 
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cvoeaef scnict 





lo(3] , 


/♦ 


hi (3] , 


/♦ 




scepsizeO] ; 


/♦ 


iac 


nstepO) , 


/* 




n; 


/* 


tnz 


atom_type ; 


/* 


foe 


pt_cfaarge; 


/♦ 


fpc 


♦weight ; 


/♦ 


iac 


avg_type ; 


/* 


fpt 


avg"scale; 


/♦ 


inc 


arbT 


/* 




♦parb; 


/♦ 



} Box. 



corner with lowest values for each axis 

" " hi -est " HUB 

increment between points 

derived as l + (hi-lo + eosilon) / stepsize 

n «. product of nstep(i] 
SYBYL at<»n type, for steric energy computation 
elemental charge at point, for elect roscatics 
weight (nl is applied in all computations, e.g»i 
box of 'scale', sphere, sphere x vdw, ...? 
scale whose meaning derived from ava^type 
arbitrary int for later use 
" pointer " " 

^BoxPtr ; 



typedef struct { 
char ^filename ; 
int n^boxes; 
int n_points ; 
BoxPtr box^array; 
int n_l:efs" 
long when_made; 



/♦ 
/* 
/* 



name of the region's file (if any) 
number of boxes which make up the region 
number of points in this region altogether 
box_array{n_regions) , each one a Box 

number of CURRENT references to this memory 
creation stamp 
} Region, *RegionPtr ; 



♦/ 
♦/ 
♦/ 
V 
♦/ 
*/ 
V 
*/ 
*/ 
♦/ 
*/ 
*/ 



*/ 
♦/ 
V 
♦/ 
*/ 
*/ 



cypedef struct { 

char •reg_name; /♦ 

char ♦fld_name: /♦ 
RegionPtr reference; /♦ 

FldEngyTyp fid: / 

int num_avgd; /* 

int curr_iter; /♦ 

char *mol_id; /* 



int n_poincs ; 
int zap_el ; 
f pt max value ; 
fpt •field^value; 
int n_refs ; / 

long when_made; / 
int vol_avg_type ; 
fpt scaie_vol_avg; 
int dielectric; 
int repulsive; 
FldHowTyp how^made; 
} Field. ♦FieldPtr 



name of the region's file (if any) ♦/ 

name of this field's file (if any) ♦/ 

the region referenced by this field */ 

* what type of field is referenced here *> 

number of fields averaged into this one ♦/ 

number of iterations in current field fit run */ 
unspecified molecule id, 

e.g. dbname/moiname/alignname ♦/ 

/♦ number of points in associated region */ 

/♦ whether electrostatics are MISSING when>max_st ♦/ 



/♦ largest permitted absolute value of energy 
/♦ values at each point of the field 

number of CURRENT references to this memory 
creation scamp 

items 1/30/89 DEP */ 



/♦ added these 4 



/♦ perry's way - l or old way 
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o^«pri^^^^°^n2^?'' infonnacion solicited bv QSAR table ODerations, 
passea lato COMFA column field evaluations ' •/ 

typedef struct { 

ctoi^^" ^i^^nLo"^"^'' '/* Whether a field name exists (otherwise alianment) 

*™T^' ' * °^ alignment; Nil align«.use as is^f 

Z^l *steric_name; /* name of steric field (if applicable)' 

SdPtr iJfd^S^^ electrostatic field !if aSlicSie 

Fieldftr ISJ-S " steric field in memory (when there) 



V 
V 
V 

*/ 
*/ 

V 



/♦ molecule- independent information for CoMFA evaluations ♦/ 



typedef struct 
int vol^avg 
fpt vol^scale 
int fld_typea 



fpt 
int 
fpt 
int 
int 



steric_rocDc; 
repulsive ; 
elect_max ; 
dielectric- 
elect out ; 



char ♦region_name; 

FieldPtr sweight_fld; 
FieldPtr eweight_fld; 

FldHowTyp how_done ; 

int du_lp_steric; 

int du_lp elect; 



/* 
/* 
/* 

/♦ 
/♦ 
/* 
/* 
/* 



s^^! '?L"SiV!?L"r"^^ O.l,2=none,box.sphere(0) V 



/♦ 



mt spare! ; /• 
int spare2; /* 
} ComfaTop, *ComfaTooPtr- 



scale for volume averaging (i.6) 
case for what fields: 0, 1. 2=both. steric, elect. (O) 
maximum steric energy (30) 

steric repulsive exponent - 12, 10, or 8 (12) 

tnaximum electrostatic energy (30) 

case for dielectric (AS FORCE FIELD TAILOR) 

case to drop elect inside steric max: 0,i«t, F (i) 

name of region used in the CoMFA confutations 

points to MEmKY field for weighting steric PLS 
points to MEMORY field for weighting elect PLs 

/* perry's way = l or old way « 0 ♦/ 
include dummies and lone pairs in steric field 
calculations ♦/ 

include dummies and lone pairs in electrostatic 
field calculations ♦/ 

^Tn25.1;i^?^^ • ^^^^ TAILOR! COMFA! TRANSFORM*/ 
INDICATOR SCALE among other things . 



*/ 

*/ 
*/ 



Sendif 
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Section III-B. Functional descripcions of external procedures. 
(Routines that simply return dynamic memory to the heap are not 
described.) 

BOND.V^ERING • TRUE if bond is in an external ring. 

fiOND_V_IRING - TRUE if bond is in an internal (simple) ring, 

QSAR_FIELD_EVAL_GETOFF - provides coordinates for field 
computation when "volume averaging" is being done. 

QSAR_FIELD_VDWTAB - rciurns steric parameters for the 
computation of the field contribution from the probe atom and each 
of the molecule atoms. 

SYB_AREA_GET_MOLECULE - returns the internal representation of 
the molecule in some area or "conuiner", if such exists. 

SYB_ATAB_ATOMiC.NUMBER - returns the atomic number of the 
specified atom type. 

SYB.ATAB.ATOMIC.WEIGHT - returns the atomic weight of the 
specified atom type. 

SYB_ATAB_HBOND_ACCEPT - returns TRUE if the specified atomic 
type is a hydrogen-bond accepting atom. 

SYB_ATAB_VDW_RADn - returns the atomic radius of the specified 
atomic type. 

SYB_ATOM_FIND_ID - returns the internal representation of an atom 
referenced by its atom ID number (Atom IDs are guaranteed to be 
continuous but the ID of any single atom may change as atoms are 
added or deleted.) 

SYB_ATOM_FIND_REC - returns the internal representation of an 
atom referenced by its record ID number. (Atom record IDs are 
invariant - but there may be "holes" in their sequence such that the 
largest record ID may be greater than the number of atoms.) 

SYB_ATOM_FIND_SET - returns the bitset of atoms corresponding to 
a list of atoms. 
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SYB_BOND_FIND_REC - returns the internal representation of a bond 
referenced by its (invariant) record ID number. 

SYB_BTAB.MNEM_TO_TYPE - converts an ASCII represenution of a 
bond type to its internal representation. 

SYB.EXPR.ANALYZE - parses a user-entered ASCII description of 
atoms (e.g., M2(<H>) for all hydrogen atoms within molecule M2) 
into internally valid representations of molecule and atoms. 

SYB.HBOND.DONORS - returns the set of IDs for atoms which arc 
hydrogen-bonding hydrogens. 

TAILOR.STORE.IT.HERE - returns the current value of a user- (and 
SPL-) accessible variable. 

TBL_ACCESS_lNDEX_TO_COLNAME - converts a user-provided MSS 
column ID to a column name (name is guaranteed to be a unique 
identifier). 

TBL_GRAB_COMPLETE.FPTS - done returning multiple (scalar) values 
in an MSS column to an array. 

TBL_GRAB.GET.FPTS_INV - in a multiple value retrieval, returns the 
value corresponding to a user-provided row ID. 

TBL_GRAB_INIT_FPTS - set up for returning multiple (scalar) values 
in an MSS column to an array. 

UBS.OUTPUT.MESSAGE - equivalent to fprintfO 

UIMS2_VAR_GET_T0KEN - returns the current value of a slobal SPL 
variable. 

UIMS2.WRrrE_ERROR - writes text to the error output stream. 

UTL.FILE_FCLOSE. UTL.FILE.FOPEN - equivalent to fcloseO and 
fopenO. 

UTL_LIST_RETRIEVE - returns the next element on a linked list. 
UTL_MEM_ALLOC - equivalent to malloc{). 
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UTL_SET_AND.INPLACE - makes the firsi set logicailv equivalent lo 
the second set, with only those bits that are also 1 in the third set 
beconiing 1 in the first set. 

UTL_SET_CARDINALrrY - returns the number of bits that are 1 in a 
particular bitset. 

UTL_SET,CLEAR - sets all bits in the set to 0. 

UTL_SET_COPYJNPLACE - makes the first set logically identical to 
the second. 

UTL_SET^CREATE - creates and returns an empty set of requested 
size. 

UTL_SET_DELETE - sets the specified bit to 0. 

UTL_SET.DIFF_INPLACE - makes the first set logically equivalent to 
the second set, with ail bits that are 1 in the third set becoming 0 in 
the first set. 

UTL.SET_EMPTY • TRUE if all bits in the set are 0. 

UTL.SET.INSERT - sets the requested bit to 1. 

UTL_SET_MEMBER - returns TRUE if the requested set bit equals I, 

UTL_SET„NEXT - returns the identity of the next non-zero bit in a 
set. 

UTL_SET,ORJNPLACE - makes the first set logically equivalent to 
the second set. with ail bits that are I in the third set becoming I in 
the- first set. 

UTL_STR_CMP,NOCASE • non-case sensitive version of .<;trcmp(). 
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APPENDIX -B" 



/* CODE. This code implements a PHORE_LOC column type and 
calculates a single 

cell value (the Hydrogen Bonding Fingerprint for a molecule) 
5 within the SYBYL 

Molecular Spreadsheet. It is to be understood that other 
supporting code handles user input, user output, and disk file 
I/O. */ 

/* data structure for PHORE_LOC column type */ 
10 typedef 

struct PHORE { 

char *disco_fn; /* user name for DISCO feature file - 

default 

appears below */ 

15 int disco_in; /* internal flag if DISCO feature file 

loaded */ 

char *region_fn; /* user name for defining region file 

*/ 

RegionPtr rgn; /* internal reference to region when 

20 loaded */ 

int nfuzz; /* number of extra lattice points (each 

direction) 

for each PHORE feature */ 

int nbits; /* set length (must agree with rgn 

25 contents or EVAL 
fails) */ 

} PHORE, *PPHORE; 
/ *+E : QSAR_PROC_EVAL_PHORE_LOC * / 

/**************************************************** 
30 ***^******/ 

/* int QSAR_PROC_EVAL_PH0RE__LOC(tablename, row, colname) 
*/ 

/* 

*/ 

35 /* Dick Cramer 31-Jul-95 (PH0RE_LOC lattice bitset 

) V 

/* 
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*/ 

/* This module generates bitsets whose cardinality is equal to 
*/ 

/* lattice points x 2 (# of sitepoint classes. For each 
5 V 

/* instance of a pharmacophoric point in the molecule being 
*/ 

/* processed, the geometrically nearest (l+ro)*3 bits in the 
*/ 

10 /* bitset will be set to 1 (where m is user supplied) . 
*/ 

/* 

*/ 

/* NOTE: this routine explicitly requires that sets begin after 
15 a */ 

/* first element that is the set size!!! 

*/ 

/* 

*/ 

20 /* Inputs 
*/ 

/* 

*/ 

/* Outputs 

25 */ ; 

/* 

*/ 

/* User Required Definition Files 
*/ 

30 /* 

*/ 

/****** ******************************************** ************ 
**********/ 

/*-E*/ 

.35 int QSAR_PROC_EVAL_PHORE_LOC(tablename, row, colname) 
char *tablename, *colname; 
int row; 
{ 

mol_ptr mol ; 
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PPHORE phr; 

int err, status, nvalid, mol_area; 

char *dum; 

set_ptr pr int , qsar_pr oc_calc_phore_set ( ) ; 
5 FILE *fp; 

/* get the molecule */ 

if ( ITBL_UTL_GET MOLECULE{tablenaine, row, FALSE, &mol) ) 

{ 

if { UTL_ERROR_IS_SET() ) {err=l; 

10 goto 
error; > 

else return FALSE; 

} 

/* get the user-provided input data */ ' 
15- if ( lTBL_ATTR_FIND_COLUMN_A(tablenaine, colname, 

"PRGC_SUPPORT", fidum, 

(int *)&phr) ) {err=3; 

goto 
error; } 

20 /* retrieve DISCO stuff if not yet present */ 
if ( 1 phr->disco_in) { 
if ( lphr->disco_fn) {err=l; goto error;} 
/* set appropriate tailor value, then initialize DISCO */ 

sprintf ( str, •'SETVAR TAILORi DISCO I FILE %s", phr->disco_f n 

25 ); 

UH4S2_EXEC_CGMMAND( str ); 
UIMS2^EXEC_C0MMAND{ "DISCO INIT" ); 
phr->disco_in = TRUE; 

} 

30 /* retrieve region if not yet present */ 
if {!phr->rgn ) { 

if ( !phr->region_fn) {err=l; goto error;} 

if (!{phr->rgn = QSAR_REGION_RETRIEVE{ phr->region_f n ) 

)) 

35 {err=4;goto error;} 

if (phr->rgn->nboxes > 1 ) { 

sprintf ( str, "WMWING: Region %s has %d boxes. 

Only first 

will be used. \n" , 
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phr">region_fn, phr->rgn->n_boxes ) ; 
UBS_OUTPUT_MESSAGE( stdout, str ) ; 

} 

phr->nbits = 2 * phr->rgn->n_points; 

5 } 

/* evaluate this result, first the DISCO call */ 

if (J( print = qsar_proc_calc_phore_set( mol, phr, invalid )) 
) {err=12; 
goto error;} 

10 /* go store both the bitset in the MSS •*Cell_Support" and the 
number of bits 

actually set in the "CELL", so there's something for the user to 
see */ 

if ( lTBL_ACCESS_X_PUT_VALUE(tablename, row, colname, 
15 "CELL_SUPPORT" , 

(int *)&print) ) {err=:ll; 

goto error; } 

if ( •TBL_ACCESS_X_PUT^VALUE(tablenaine, row, colname, "CELL", 

(int *) invalid) ) {err=ll; 

20 goto 
error;} 

return TRUE; 
error: 

sprintf (str, "QSAR_PROC_EVAL_PHORE_LOG (%d) " , err); 
25 UTL_ERROR_ADD_^TRACE (str) ; 

return FALSE; 

} 

set__ptr qsar jproc_calc_phore_set ( mol, phr, nvalid ) 
/* creates actual bitset */ 
30 moljtr mol; 

PPHORE phr ; 

int *nvalid; 

{ 

set_ptr anset = NIL, pset = NIL, SYB_FEAT_FIND_ID_SET ( ) ; 
35 feat_ptr featp, SYB_FEAT_FIND_REC() ; 
atom^ptr a, SyB_ATOM_FIND_REC() ; 

int err, elem, sitebase, ci, xybase, boff. It base[3], 
lt_off[3], loff = 
0, hioff = O ; 
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fpt tinp; 

BoxPtr bxptr; 
line_ptr cdp; 

if (!( anset UTL_SET_CREATE ( phr->nbits ) )) {err = 1; goto 
5 error;} 

*nvalid ^0; 

if (phr->nfu22) { 

loff -= phr->nfuzz / 2; 

hioff += {phr->nfu22 + 1 ) / 2; 

10 } 

bxptr = phr->rgn->box_array; 
xybase = bxptr->nstep(OJ * bxptr->nstep[lj ; 
/* generate the DISCO sites for this molecule, which */ 
UIMS2_EXEC_COMMAND( "ECHO %DISCO_SITES ( ) " ); 
15 /* ..become "FEATURES" + "dummy atoms" within SYBYL's molecule 
data 

structure */ 

pset = SYB_FEAT_FIND_ID_SET(mol, FEAT V LINE, 1, mol->nf eats) ; 
if (pset ) { 
20 elem = -1; 

while((elem = UTL_SET_NEXT(pset,elem) ) != N0_M0RE_ELEM) { 
if (!(featp = SYB_FEAT_FIND_REC (mol,elem))) goto error; 
if ((featp->name[l] == 'S') && (f eatp->name[2 ] '„')) { 
/* have an H-bonding feature, it must represent a line */ 
25 sitebase = featp->name(0] == 'A' ? 0 : phr->rgn->n_points; 

/* the dummy atom at the end of the line is our H-boriding locus 

cdp = (line_ptr) featp->dataptr; 

if (!(a = SYB_ATOM_FIND__REC {mol, cdp->positn) ) ) {err=2 ; 

30 goto 
error; ) 

for (ci = 0; ci < 3; ci++ ) { 

tmp = (a->xyz[ci] - bxptr->lo[ci] ) / 
bxptr->stepsize[ci] ; 
B5 lt_base[ci] = (int) (tmp < 0.0 ? tmp - 

bxptr->stepsize[ci] : 
tmp ) ; 

} 

/* cycle through all points touched by this locus that are also 
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within the 
region */ 

for (lt_off(0] = lt_base[0] + loff; lt_off(0] <= 
ltj3ase[0] + hioff ; 
5 lt_off[0]++) 

if (lt_off[0] >= 0 && lt_off(0] < bxptr->nstepIOJ) 

for (lt_off[ll = lt_base[l] + loff; lt_off[lj <= 
lt_base[l] + 
hioff; lt_off [1]++) 
10 if (lt_off[l] 0 && lt_off(l] < bxptr->nstep[l]) 

for (lt_off[2] = lt_base[2] + loff; lt_off[2] <= 

lt_baset2] + 
hioff; lt_off(2]++) 

if (lt_off[2] >= 0 && lt_off(2] < bxptr->nstep[2] ) 

15 { 

boff = xybase * lt_off[2] + 

(bxptr -> nsteptO]) * lt_off(l] 

+ 

lt_off(0] + sitebase; 
20 UTL_SET_INSERT( anset, boff ); 

(*nvalid)++; 

, } 

} 

} 

25 UTL_SET_DESTROY ( pset ) ; 

} /* pset exists */ 
retairn( anset ) ; 
error : 

sprintf (str, "qsar_proc_calc_phore_set(%d) err); 
30 UTL_ERROR_ADD_TRACE (str) ; 

return FALSE; 

} 

# This file determines the recognition of site points in 
Sybyl/DISCO, 

35 # See the SYBYL DISCO manual for detailed documentation. The 
defined types 
eure 

# (1) HB : the QUERY is searched in the SEARCH mode, and all 
occurences 
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# are assigned DISCO features according to the 
remaining 

# specifications — the three ATOMS refer to the atom 
number 

::5 # in QUERY such that the feature is DIST from the 

first atom 

# at bond ANGLE with the first and second atom at each 
of the 

# TORSIONS formed by the site point and the three 
10 ATOMS in order- 

# A sitepoint of NAME is added at these extension 
points, 

# — and — the first atom is assigned a feature 
compi imentary 

15 # to the extension point (such as HBD_CO_ and 

RHBD_CO_) , ■ 

f (2) HBex: differs from HE in that the angles and torsions are 
replaced 

# by two other arguments : whether lone pairs are part 
20 of the 

# extension point placement, and which ATYPE 
(generally LP 

# and/or H) determine the direction of the sitepoints. 
# 

25 #TYPE NAME ATOMS SEARCH DIST ANGLE TORSIONS QUERY 

HE DS_02C2_ 4 2 1 NbDup 2*9 120 "0.0 180,0" 

HevC(Any)=0(f ] 

HE DS_03Car_ 13 4 All 2.9 119 "0.0 180.0" 0( f ]HC ( :Hev) : Hev 
30 HE DS_03Car_ 12 3 All 2.9 119 "0.0 180.0" O [ f ]C ( : Hev) : Hev 
HE DS_03Car_ 13 4 NoDup 2.9 119 "0.0 180.0" 0[f]HC(=0) 
HE DS_03Car_ 12 3 NoDup 2.9 119 "0.0 180.0" 0[f]C(=0) 
HE DS_03Gar_ 2 1 3 All 2.9 120 "0.0 180.0" C( :0(f ] ) :0[f ] 
HE DS_03C3_ 13 6 NoDup 2.9 117 "60 180 300" 
•35 0[f ]HC(Any) (Any)C{Any) (Any)Any 

HE DS_N3C3_ 14 5 NoDup 2.9 110 "60 180 300" 

N[f ]H2ZC{Z:C&!C=0&!C:Hev} 

HE DS_02S_ 3 2 1 All 2.9 120 "0.0 180" AnyS (=0) (=0) NH 
#TYPE NAME ATOMS SEARCH DIST LP ATYPE Query 
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HBex DS_03C3_ 2 13 NoDup 2.9 YES "LP H" 
Otf ]HC(Any) (Any) Z{Z:Hev&!C(Any) (Any)Any} 

HBex DS_03C3_ 3 12 NoDup 2.9 YES "LP" 0( f ] (Z) Z{2 :C& !C=Het} 
5 HBex DS_N3C3_ 2 14 Nodup 2.9 "" "H" 
N ( f 1 H2 YaZ { Z : He v& ! C } { Ya : C& 1 C=0& ! C : Hev } 

HBex DS_N3C3_ 2 13 NoDup 2.9 YES "LP H" 

Ktf ]H(Ya)Ya{Ya:Cfri^O&lC:Hev} 

HBex DS_N3C3_ 3 12 NoDup 2.9 YES "LP" 

10 Ntf J (Ya) (Ya) Ya{Ya:C&lC=0&!C:Hev} 

HBex DS_N2C2_ 2 13 NoDup 3.0 YES "H LP" N[f]H=C 

HBex DS_N2C2_ 12 3 NoDup 3.0 YES "H LP" Any-N[f]=C 

HBex DS_N2C2_ 12 3 NoDup 3.0 YES "LP" Any-N[r]=C[r ] 

HBex DS_N2N2_ 2 13 NoTriv 3.0 YES "LP H" N[ 1] H: C:C:N( f ] : C: gl 

15 HBex DS_N2N2_ 2 13 NoTriv 3.0 YES "LP H" N[ 1]H: C: C:N [ f ] : C: §1 
HBex DS_N2N2_ 3 2 1 NoDup 3.0 YES "LP" C:N[f]:Hev 
hb DS_03S_ 3 2 1 NoDup 2.9 128 "0.0 180.0" HevS=0[f] 
hb DS_03S_ 4 2 1 All 2.9 128 "0.0 180,0" 

HevS(-0[f ])=0[f ] 

20 hb DS_03S_ 4 2 1 All 2.9 128 "0.0 180.0" 

HevS(-0[f]) ("0[f])-0[f] 

hb DS_03N__ 3 2 4 All 2.9 128 "0.0 180.0" 

HevN(0[f ])0[f ] 

hb DS_02N_ 4 2 1 NoDup 2.9 128 "0.0 180.0" 

25 HevN(Hev)-Otf ] 

hbex DS_N2N2_ 3 2 1 NoDup 3.0 YES "LP" N:Ntf]:N 
hb DS_03P_ 3 12 All 2.9 128 "0.0 180.0" 

P{-0)(-0)(-0)(-0) 

hb DS_03P_ 3 12 All 2.9 128 "0.0 180.0" P (-0) (-0) (-0) 

30 # #CLASSNAMES# Acceptorsite Donor^Atom DL 

HB AS_H03C2_ 13 4 All 2.9 119 "G.O 180.0" 0[ f ]HC( :Hev) :Hev 
HB AS_H03C3_ 13 6 NoDup 2.9 117 "60 180 300" 
0[f ]HC(Any) (Any)C{Any) (Any)Any 

HB AS_N3C3_ 14 7 NoDup 2.9 110 "60 180 300" 
35 N[f ]H2C(Any) (Any)C(Any) (Any)Any 

HB AS_N3C3_ 15 8 NoDup 2.9 110 "60 180 3 00" 
N(f ]H3C(Any) (Any) C (Any) (Any) Any 

#TYPE NAME ATOMS SEARCH DIST LP ATYPE Query 
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10 



15 



20 



25 



30 



185 



HBex AS_HN2C2 
HBex AS_HN2C2 
HBex AS_HN2C2 
HBex AS H03C3 



2 13 NoDup 3,0 "» "H" NHC (Any) =0( f ] 
3 2 1 NoDup 3.0 YES "LP H" C:N[f]H:Hev 
6 5 4 NoTriv 3-0 YES "LP" N[ l]H:C:C:N[f ) :C: §1 
2 13 NoDup 2.9 YES "LP H" 
0[f ]HC{Any) (Any) Z{^:Hev&!C(Any) (Any) Any} 

HBex AS_HN2C2_ 3 2 4 Nodup 3.0 YES "LP H" HevN[flH=C 
HBex AS_HN2G2_ 12 3 Nodup 3.0 YES "LP" HevN(f]=C 
HBex ASTm2C2_ 2 1 4 Nodup 3.0 -«" "H" N[f]H2C(N)=N 
HBex AS_N3C3_ 2 14 Nodup 2.9 YES "LP H" 
Ntf ]H2C(Any) (Any) Z{Z:Hevfic 1 C(Any) (Any) Any} 



HBex AS_N3C3_ 


2 15 Nodup 


2.9 YES 


"LP 


H" 




N[f ]H3C(Any) (Any) Z{ Z:Hev& ! C(Any) (Any) Any} 






HBex AS_N3C3_ 


2 13 


NoDup 




2 . 9 YES "LP 


H" 


Ntf ]H(Ya)Ya{Ya:Cfit!C=0&lC:Hev} 








HBex AS_N3C3_ 


2 14 


NoDup 




2.9 YES "LP 


H" 


N[f 3H2(Ya)Ya{Ya 


:C&!C=0&!C:Hev} 








HBex AS_N3C3_ 


2 13 


NoDup 




2.9 YES "LP 


H" 


Ntf ]H(Ya) (Ya)Ya{Ya:C&iC=0&!C:Hev} 








HBex AS_N3C3_ 


3 12 


NoDup 




2 . 9 YES 


"LP" 


Nff ] (Ya) (Ya)Ya{Ya:C&!C=0&!C: 


Hev} 








HBex AS_HN2C2_ 


2 13 NoDup 


3 • 0 YES 


"H 


LP" N[f]H=C 




HBex AS_HN2C2_ 


3 12 NoDup 


3.0 YES 


"LP" Ntf)=C-Any 




HBex AS_HN2C2_ 


2 14 NoDup 


3.0 "" 


"H" 


Nff ]H2Hev(:Hev) 


:Hev 


HBex AS_HN2C2_ 


2 13 NoDup 


3.0 "" 


"H" 


N[f ]HHev(:Hev) 


:Hev 


HBex AS_HN2C2_ 


12 3 NoDup 


3.0 "" 




' HNC=Any 




HBex AS_HNS3_ 


6 5 2 NoDup 


3.0 "" 


"H" 


AnyS(=0) (=0)N[f ]H 


HBex AS_HN4_ 


2 13 


NoDup 




-3.6 "" "C* 


M 



Ntf 1(Z) (Z) (Z)Z{Z:C&lC=0&!C:Hev} 

hbex AS_HN2N2_ 3 2 1 NoDup 3.0 YES "LP" 

hb AS_03P_ 3 12 All 2.9 128 "0.0 

P(-0)(-0)(-0)(-0) 

hb AS 03P 3 12 All 2.9 128 "0.0 180.0" 



N:N[f ] :N 
180.0" 

P("0) (-0) (-0) 
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APPENDIX "C" 

EXPERIMENTAL DATA SETS 





Data Set 


Nq, Of Cp^S 


Structure. Activity 




1 Uehling 


9 


camptothecin, DNA fragmentation 


5 


2 Strupczewski 


34 


benzisoxazoles, ip Behavioral 




^ Siddiqi 


10 


adenosines. Brain Al binding 




4 Garrattl 


10 


tryptamines, nielanophore binding 




5 Ganati2 


14 


tryptamines, meianophore binding 




6 Heyl 


11 


deitorphin, opioid receptor (DAMGO) 


10 


7 Crislalli 


32 


adenosines, A2a agonists 




8 Stevenson 


5 


piperidines, NKl antagonism 




9 Doherty 


6 


triarylbutenolides, endothelin-A antag. 




10 Penning 


13 


SC-41930 analogs, LTB4 antagonism 




11 Lewis 


7 


oxazolinediones, NKl binding 


15 


12 Krystek 


30 


sulfonamides, endothelin-A antagonism 




13 Yokoyamal 


13 


oxamic acids, T3 binding 




14 Yokoyama2 


12 


oxamic acids, T3 binding 




IS Svensson 


13 


benzindoles, 5-HTA agonism 




16 Tsutsumi 


13 


peptidyl heterocycles, endopeptidase inhib 


20 


17 Chang 


' 34 


biphenyl sulfonamides, ATI binding 




18 Rosowsky 


10 


trimetrexate analogs, DHFR inhibition 




19 Thompson 


8 


peptidomimetic, HlV-l protease inhibition 




20 Dq>reux 


26 


naphthylethyl amides, melatonin dtspl. 



Literature References for Data Sets: 
25 1. Uehling, D.E-, Nanthakamur, S,S,, Croom, D., Emerson, D.L., Leitner, P.P., 
Luzzio, M.J., et al.. Synthesis, Topoisomerase I Inhibitory Activity, and in Vivo 
Evaluation of ll-Azacamptothecin Analogs. 7, Med, Oienu 199S, 38, 1106 (Table 
2, withR^^Et; ICjodata, 
2 Strupczewski, J.T., Biordeau, KJ., Chiang, Y„ Glamkowski, EJ., Conway, P.G., 
30 er al 3-[[(aryloxy)alkyl]piperidinyll-l,2-Benzisoxazoles as D2/5-HT2 Antagonists 

with Potential Atypical Antipsychotic Activity: Antipsychotic Profile of Iloperidone 
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(HP873), /. Med. Chem. 1995/38, 1119. (Tables 2 and 3 with n=3. X=0; ED50 
for inhibition of apomorphine-induced climbing.) 

3. Siddiqi, S.M., Jacx)bson, K,A,, Esker, Olah, M.E., Ji, Xi,-duo.. et al,. 
Search for New Purine- and Ribose-Modified Adenosine Analogs as Selective 

5 Agonists and Antagonists at Adenosine Receptors. J. Med. Chem. 1995, 38, 1174. 

(Table 1, Rj^H; K,(A1), values estimated from % displacement and stereoisomers 
averaged as needed.) 

4. Gairatt, P. J., Jones, R., Tocher, D. A,, Sugden, D., Mapping the Melatonin 
Receptor. 3. Design and Synthesis of Melatonin Agonists and Antagonists Derived 

10 from 2-Phenyltryptamines. J. Med. Chem. 1995, 38, 1 132. (Table 1 and Table 2). 

5. Garratt, P. J., Jones, R.. Tocher, D. A., Sugden, D., Mapping the Melatonin 
Receptor. 3. Design and Synthesis of Melatonin Agonists and Antagonists Derived 
from 2-Phenyltryptamines. 7. Med. Chenu 1995, 38, 1132. (Table 1 and Table 2). 

6. Heyl, D.L., Dandabuthla, M., Kurtz, K.R., Mousigian, C. Opioid Recq>tor 

15 Binding Requirements for the ^-Selective Peptide Deltorphin 1: Phe' Replacement 

with Ring-Substituted and Heterocyclic Amino Acids. 7. Med. Chent, 1995, 38, 
1242. (Table 1; binding K, to DAMGO.) 

7. Cristalli, G., Camaioni, E., Vittori, S,, Volpini, R., Borea, P.A., ef aL 2- 
Aralkynyl and 2-Heteroalkynyl Eterivatives of Adenosine-5'-N-ethyluronamide as 

20 Selective A2a Adenosine Receptor-Agonists. 7. Med. Chem. 1W5, 38, 1462. 

8. Stevenson, G.L, MacLeod, A.M., Huscroft, I., Cascieri, M.A., Sadowski, S., 
Baker, R. 4,4-Disubstituted Piperidines: A New Class of NK, Antagonist. J, Med. 
Chem. 1995, 38, 1264. (Table 1.) 

9. Etoherty, A.M., Patt, W.C., Edmunds7J.J. Berryman, K.A., Reisdorph, B.R., et 
25 aL Discovery of a Novel Series of Orally Actiye Non-Peptide Endothelin-A (ET^) 

Recq)tor-Selective Antagonists. J. Med. Chem. 1995, 38, 1259. (Table 3; IC50 
ET^.) 

10. Penning, T.D., Djuric, S,W„ Miyashiro, J.M., Yu, S., Snyder, J.P,, et al. 
Second-Generation Leukotriene B4 Receptor Antagonists Related to SC-41930; 

30 Heterocyclic Replacement of the Methyl Ketone Pharmacophore. J. Med. Chem. 

1995, 38, 858. (Table 1, all; LTB^ receptor binding.) 

11. Lewis, R.T., MacLeod, A.M., Merchant, K.J. Kelleher, F., Sanderson, I., et al. 
Tryptophan-Derived NKI Antagonists: Conformationally Constrained Heterocyclic 
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Bioisosteres of the Ester Linkage. J. Med. Otcm. 1995, 28, 923. 

12. Krystek, S.R., Hunt, J.T.. Stein, P.D., Stouch. T.R. 3D-QSAR of Sulfonamide 
Endothelin Inhibitors. J. Med. Chem. 1995, 38. 6S9. 

13. Yokoyama, N., Walker, G.N.. Main, A. J. Stanton, J.L. Morrissey, M., et al. 
Synthesis and SAR of Oxamic Acid and Acetic Acid Derivatives Related to L- 
ThyrcMune, /. Med. Oiem. 1995, 38, 695. 

14. Y^koyama, Nr, Walker, G.N., Main, A.J. Stanton, Jrt. Morrissey, M., et al. 
Synthesis and SAR of Oxamic Acid and Acetic Acid Derivatives Related to L- 
Thyronine. J. Med. Chem. 1995, 38, 695. 

15. Haadsma-Svensson, S.R., Svensson, K„ Duncan. N., Smith. M.W., Lin, Ch.-H. 
C-9 and N-Substituted Analogs of cis-(3aR)-(-)-2,3,3a,4,5,9b-Hexahydro-3-propyl- 
lH-benz[e]indoIe-9-carboxamide: 5HT1A Receptor Agonists with Various Degrees 
of Metabolic Stability, J. Med. Chem. 1995, 38, 725. 

16. Tsutsumi, S.. Okonogi, T. Shibahara, S., Ohuchi. S.. Hatsushiba, E., et al.. 
Synthesis and Structure Activity Relationships of Peptidyl @-Keto Heterocycles as 
Novel Inhibitors of Prolyl Endopeptidase. /. Med. Chem. 1994, 37, 3492. (Table 2, 

X ^ CH2CH2I IC50* ) 

17. CSiang, L.L.. Ashton. W.T., Hanagan. K.L.. Chen. Ts.-Bau., O'Malley. S.S., et 
al., Triazolinone Biphenylsulfonamides as Angiotensin II Receptor Antagonists with 
High Affinity forioth the AT, and ATj Subtypes. J. Med. Chem., 1994, 37. 4464. 
(Table 1. R' =(2-Cl)C4Hj; AT, (rabbit aorta] ICju ) 

18. Rosowsky, A.. Mota, C.E., Wright, J.E., Queener, S.F., 2.4-Diamino-5- 
chloFoquinazoline Analogs of Trimetrexate and Piritrexim: Synthesis and Antifolate 
Activity, y. Med. Chem-\994, 37, 4522. (Table 2; rat liver ICjo.) 

19. Thompson. S.K., Murthy, K.H.M., Zhao, B., Winbome, E., Green, D.W., et al. 
Rational Design, Synthesis, and Crystallographlc Analysis of a Hydroxyethylene- 
Based HIV-1 Protease Inhibitor Containing a Heterocyclic Pr-P2* Amide Bond 
Isostere, /. Med. Chem. 1994. 37. 3100. (Table 2. X-Boc; apparent Kj.) 

20. Dqjreux, P.. Lesieur, D., Mansour, H.A., Morgan. P., et al. Synthesis and 
Structure-Activity Relationships of Novel Naphthalenic and Bioisosteric Related 
Amidic Derivatives as Melatonin Receptor Ligands. J. Med. Chem. 1994, 37, 3231. 
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APPENDIX "D" 

A list of 736 commercially available thiols broken down inio 231 clusters based on 
topomeric CoMFA field descriptors along with the systematic name applicable to each. The 
231 clusters are sorted by proposed name, first by the "root" structure, ie,, the fragment 
5 attached immediately to the -SH, and then by the substitution paUem on that "root** 
substructure. The names describe topologtcally equivalent hydrocarbons, ie., structures in 
which all monovalent atoms are r^laced by hydrogens and the other atoms by carbons. 
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Cluster* 


Cluster 


Struct, 


Structural 


ID 


Size 


Root 


Substitution* 


1 


26 


arvl 


Single 


1 AA 


1 


aryl 


2,3,5-Me 


1 *77 

±. f f 


1 


£iryl 


2 3 5-Me-4-Pr 


163^ 


1 


aryl 


J - 14-1^, 0-jrr J onet ) Dneto 


151 


1 


->«r1 

aryl 


2 , J- (4-BU7 onetO-D-Me 


33 


5 


aryl 


O ^ Dam MM 


80 


2 


aryl 


2 , 5 -Me 


192 


1 


aryl 


2 , D-Me-3-iPe 


7 


14 


aryl 


Z t 0 -won- J 1 4 / D / -Me 


27 


> 
6 


aryl 


2,o-NOn-J-Ar 


107 




aryl 


z- (z-oz) rneciC-4, 3-Denzo 


189 


1 


aryl 


2- (3 , 5 -Me) Ar-4 , D-Benzo 


141 


1 


aryl 


2- (4-Et ) PhePr 


205 


1 


aryl 


2- {4-Stilbenyl)Stilbenyx 


188 


1 


aryl 


2-5netCH2-4 > 5-Benzo 


56 


3 


aryl 


2-Ar 


1 1 o 


X 


B«-«»1 

aryl 


2-Ar"-j*3— we 


ion 
190 


1 


aryl 


2-Ar-4 , D- ( J , 4-Et. / Benzo 


41 


/r 
O 


aryl 


z-Ar— 4, 3-Benzo 


XDZ 


X 


aryl 


^— oz 




Q 


aryl 


Z — fcu 


O D 


z. 


aryl 


z— won- J — tc we 


lU b 


-J 


aryl 


z-rnetiC-4, D-Benzo 


/ / 


A 


aryl 


2-rnerr 




X 


aryl 


Z— Ko 


X 


2, 


aryl 


z — xi>enyx 






aryl 


4 — / ^ — Mo \ R<an Ti^ 

J , f» \ J we ^ oexizo 


^ X o 


X 


aryl 


j,4-ia,D) xnaenu 


X u *s 


1 


CI J. y X. 


j#4 — io»jlj, %o MX } xiiu6ii%j ^ — o — we 


7 o 




A T^/1 

oxy X 


j#4— io#x?# \c -we i xnoenvj ) 


99 


3 


aryl 


3.4- (a - b-NanKtho \ 


157 


1 


aryl 


3.4-Ar 


58- 


3 


aryl 


3,4-Benzo-5-Me 


XOO 


2 


aryl 


3 . 4-Benzo-6-tBu 


37 


5 


aryl 


3,5-Me 


180 


1 


aryl 


3- (2 , 3-Benzo-4-Et ) 5het 


199 


1 


aryl 


3 - (2 , 3 -Benzo- 5 -Me) 5het 


182 


1 


aryl 


3- (2-Me-3-5het-5-Et) 5het 


115 


2 


aryl 


3-(3-5het)5het 


193 


1 


aryl 


3-(3-Ar)5het-4-Me 


67 


3 


aryl 


3-Ar 


129 


2 


aryl 


3-Ar-4- (2-Me) 5hetCH2 


46 


4 


aryl 


3-Ar-5-Me 


155 


1 


aryl 


3-B2 


82 


2 


aryl 


3-B2-5, 6-Benzo 


10 


16 


aryl 


3 -Me 
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70 


3 


aryl 


73 


3 


aryl 


95 


2 


aryl 


88 


2 


aryl 


81 


2 


aryl 


48 


4 


aryl 


2 


23 


aryl 


92 


2 


aryl 


90 


4 


aryl 


19 


8 


aryl 


148C 


1 


aryl 


228 


1 


aryl 


12 


10 


5het 


50 


4 


5het 


139 


1 


5het 


89 


2 


5het 


173 


1 


5het 


69 


3 


5het 


198 


1 


Shet 


174 


1 


5het 


171 


1 


Shet 


170 


1 


5het 


123 


2 


5het 


22 


7 


Shet 


202 


1 


Shet 


122 


2 


Shet 


197 


1 


Shet 


6 


14 


Shet 


225 


1 


Shet 


224 


1 


Shet 


63 


3 


Shet 


178 


2 


Shet 


72 


3 


DTiec 


40 


5 




183 


1 


Dfiec 


64 


3 




105 


2 




160 


1 




146 


1 




203 


1 


^ilC u 


126 


z 






q 


Shet 


211C 


1 


Shet 


124 


2 


Shet 


28^ 


6 


Shet 


30 


6 


Shet 


204 


1 


Shet 


79 


2 


Shet 


78 


2 


Shet 


117 


2 


Shet 


186 


1 


Shet 


68 


3 


Shet 


112 


2 


Shet 



3-Naphth 
3-Pr-4-sBu-6-Me 

3- iPr 

4- Ar 
4-Bz 
4-Et 
4 -Me 
4-R9+ 
4-iBu 
6-NoH 

(adenosine) 

(fluorescein) 

Simple 

2.3-(a,b-Naphtho) 

2,3-5hetO-4-Me 

'2,3-Ar 

2-(2,S-Et)Ar-3-Et 

2- (2-Me) Ar-3- (2-Me) PheEt 

2- (2-Me)Ar-3-R10 

2-(2-sBu)-3-Et 

2- (3 , S-Me) Ar'-3-Shet 

2- (3 , S-Me) Bz-3 , 4 -Benzo 

2- (3-Et)Ar-3-Bz 

2"(4-Et)Ar 

2- (4-Et) Ar-4- (4-Me) Ar 
2- (4-iPr)Ar-3-Bz 
2-5hetCH2-3- (4-tBu) Ar 
2-Ar 

2-Ar-3- {2-Ar) ShetBu 

2 -Ar-3 - ( 2-Ar ) 5hetCH2 

2-Ar-3-(2-Bz)Ar 

2-Ar-3- {2-Me) Shet 

2-Ar-3-{3,4-Et)Bz 

2-Ar-3-(3-Ar)SHetEt 

2-Ar-3-(3-Ar) Phe Pr 

2-Ar-3- (3-Ar-5-Me) Shet 

2-Ar-3-(3-Me)Ar 

2-Ar-3-(4-Ar)<^^hx 

2 - Ar - 3 - { 4 - Ar ) CyhxCH2 

2-Ar-3-(4-PheEt)Ar 

2-Ar-3-(tBu)Ar 

2-Ar-3-Ar 

2 - Ar - 3 -Benzyl idene 

2-Ar-3-IndenCH2 

2 -Ar-3 -Me 

2-Ar-3-PhePr 

2-Ar-5- (4- (2 , 4-Me) Bz) Ar 

2-Bz 

2-BZ-3 , 4-Benzo 
2-Cyhx 

2-Cyhx-3, 4-iPe 
2-Et 

2-Et-3- {2-Me) PheEt 
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1 




5het 






5het 


ol 




5het 




X 


5het 




4 


Shet 


oo 


2 


5het 


Q 1 




Shet 


A 

4 


1 7 
X / 


Shet 


IT) 


1 

X 


Shet 


Jo 




Shet 


X3 


xu 


Shet 




X 


Shet 


oo 


•J 


Shet 


29 


c 
o 


Shet 


/I 




Shet 


108 


o 




127 






54 


J 




221 


X 




187 


•\ 




143 


X 


3 lie u. 


96 


4. 






X 


J lies ^ 


loy 


X 


Shet 


94 


£, 




210 


X 




3o 


1 3 




I/O 


X 




IOC 

19o 


X 




1 CQ 


1 




4^ 


A 
*4 




zuu 


X 


Shet 


X u 




Shet 






Shet 


191 




Shet 


145 




Shet 


114 


2 


Shet 


18 


8 


Shet 


59 


3 


Shet 


65 


3 


Shet 


24 


7 


Shet 


44 


6 


Shet 


52 


5 


Shet 


111 


2 


Shet 


153 


1 


Shet 


32^ 


6 


Shet 


223 




Shet 


185 




Shet 


34 




alkyl 


104 




al)cyl 


62 




alkyl 


3 


18 


alkyl 


14 


9 


alkyl 



2-Me-3 , 4- ( 3-Me) Benzo 

2-Me-3 ,4-Benzo 

2-Me-3- (2 , 3 . 4-Me) Shet 

2 -Me-3 - (2,3 -Benzo- 4 -Et ) Shet 

2-Me-3-(3-Ar)5het 

2 -Me-3 - {3-Ar) ShetPr 

2-Me-3- (3-Ar-5-Me) Shet 

2-Me-3-(3-Bz)Ar 

2-Me-3- (4-tBu) PheEt 

2-Me-3-5Het 

2-Me-3-Me 

2-Me-3-Pe 

2-Me-3 -PheEt 

2-Me-3-PhePr 

2-Me-3-R8+ 

-2-Me-S-Bu 

2-Pe-3-Ar 

2-Pr 

2-R12 

2-iBu-3,4-iPe 

2- iPe"3, 4-Benzo 
3,4-(2.4-Me)Benzo 
3,4- (3-Ar)Benzd . 
3,4-(3-Hx)Benzo 
3,4-(3-Pr)Benzo 
3,4-{a,b-Napththo) 
3 , 4 -Benzo 

3- (2,4-Me)Bz 
3-(3,5-Me)Ar 
3- {3-Ar) Shet 
3-(3-Bz)Ar 
3-{3-Me)PheEt 
3-{4-Me)Ar 
3-(4-tBu)Ar 
3-(Al-4-Et)PheEt 
3-{3-Ar) PhePr 
3-5hetCH2 

3-Ar 

3-Ar(2-thia) 
3-Bu 

3-Me-5-H 

3-Me-5-NoH 

3-Pe 

3 -PheEt 

3-PhePr 

3-Pr 

3-R13 

(chrysenO) 

Simple 

(3) {BD (Bl) 

(3-Me)PhePr 

(3:4) 

(3:4) (Al) 
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*> 
3 


axKyx 


i J : 4| \BX| 


226 


1 


axjcyx 


\%) lAX-l iA— touj n — LJ lux; 




4 


aXKyx 


\% } |LIX 1 |UX/ . 


35 


7 


axicyx 


1 *"we/ xrnetrj; 


168 


1 


axicyx 




47 


4 


axJcyx 




179 


1 


aXKyx 


^3/ \oX| 1 fc— \^"^Ai D^nej onet; 


103 


2 


axKyx 




76 


2 


alJcyX 


(Dj iCX) iCXJ 


83 


2 


aXlcyX 


(5) (C2) 


216 


1 


alkyx 


(!)} (C2) (D2| |D2) 


43 


8 


alkyl 


(3:0| (DX/BX/FX} 


5 


15 


axKyx 




ICO 


J. 


aXKyX 




140 


X 


axjcyx 


\ o / 1 r ""/ii 1 


loo 


X 


axKy X 


1 / 1 \ riO / \ r X i 




1 

J 


aXKy X 


\ / ; \ u J 7 1 J-/ J / 


20 / 


X 


axicyx 


/ Q 1 /p** \ 
I O 1 \V^J ) 


8 


li 


aXKyx 


I O . XX J 


206 


1 


alJcyi 








ax icy J. 


\ Xw 1 \DX| 1 CtX / 


136 


1 


alkyl 


<10) (C1XE5) (E2) 


20 


8 


alkyl 


(10+) (Bl) 


39 


7 


alkyl 


(11+) (Bl) 


154^ 


1 


alkyl 


(12) (A-PheEt) 


230 


i 


alkyl 


(12) (F6) (Fl) 


131 


2 


alkyl 


(12) (F6) (F6) 


15 


9 


alkyl 


(12 + ) 


137 


1 


alkyl 


(13) (E4) 


231 


1 


alkyl 


(A-Ar) (A-Ar)Bz 


229 


1 


alkyl 


(A-Bz) (A-Bz)PheEt 


184 


1 


alkyl 


(Al)PheEt 


227^ 


1 


alkyl 


(cholesterol) 


^ X *i 


\ 


alkyl 


( cryptate ) 




7 


alkyl 


PheBu 


74 


■J 


alkyl 


PheEt 


2 b'-' 


0 


axxyx 


PbePr 


±1 


xu 




Q 1 Tnn 1 A 


102 


2 


benzyl 


2, 4, 5-Me 


57 


3 


benzyl 


2, 4, 6-Me 


217 


2 


benzyl 


2- (3-(2-Et)Ar)Ar 


213 


1 


benzyl 


2-Et-3- (2, 3-Et-5-Me) Ar-5-Me 


212 


1 


benzyl 


2 -R8 -3 -Naphthyl - 4 , 5 -Benzo 


9 


13 


benzyl 


2/3-Me 


84 


2 


benzyl 


3, 4 -Benzo 


132 


2 


benzyl 


3,5-Me 


130 


2 


benzyl 


3- (4-Stilbenyl)Stilbenyl 


134 


2 


benzyl 


4^(3-Ar)Ar 


21 


7 


benzyl 


4-Et 


26^ 


6 


benzyl 


4 -Me 


156 


1 


benzyl 


4-PhePr 


201 


1 


benzyl 


4-tBu 


135 


2 


alkenyl 


Ar..(2-Et)Ar 
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1 


alkenyl 


Ar. . (4"B2)Ar 


116 


2 


alkenyl 


Ar..Ar 


133 


2 


alkenyl 


Ar . *Bz 


110 


2 


alkenyl 


Et.CN,CONH2 


87 


2 


alkenyl 


NH2.CN.N=NPh 


HQ 


2 


alkenyl 


P(NMe2)3. .Ar 


X A W 


2 


alkenyl 


P(Pr)3. .Ar 


118 


2 


alkenyl 


P{iPe)3. .Ar 


51 


4 


alkenyl 


PCyhx3..Ar 








PEt3. . (2-Bz)Ar 


31^ 


6 


* alkenyl 


XrbU^ • ♦ rVX. 


194 


1 


alkenyl 




109 


2 


alkenyl 




101 


2 


cyclohexyl 


Simple 


149 


1 


eye 1 ohexy 1 




55 


3 


cyclohexyl 


'2,3,4,5-iBu 


147 


1 


cyclohexyl 


2,3,4-iBu-5-iPe 


209 


1 


cyclohexyl 


2-(3, 4-PheEt) 5het-6-Me 


208 


1 


cyclohexyl 


2-Me-3,5-CMe2 


167 


1 


cyclohexyl 


.2-Me-4-sPe 


165 


1 


cycloheixyl 


2-iPr-3,5-Me 


150 


1 


cyclohexyl - 


3-sPe-6-Me 


161 


1 


cyclohexyl 


4-Et-4-iBu 


219 


1 


cyclohexyl 


(coitplex) 


175 


1 


eye 1 open tyl 


2-Ar-4-spiro 


215 


1 


eye 1 open tyl 


3-PhePr 



ajo generate these names, all heteroa tomfi are first replaced by 
carbon (to produce the simplest common topology) and a particular 
structure is chosen from among these topologies as the "most typical" 
of that cluster, if possible to contain the largest substructure that 
distinguishes that cluster from all others. 

Within the name of a substitution, numbers indicate positions when 
substitution is on a ring, but chain length when substitution is on a 
chain (numbers separated by a colon indicate a range of chain 
lengths). Also, within a chain, letters indicate a position of 
substitution. (For example, (C2) describes a two atom branching from 
the third position of a chain, while 3-PhePr describes a phenyl 
propyl skeleton attached to the 3-position of a ring. ) 

A dot notation (,) separates the three possible substituents on an 
alkenyl root, the substituent order being same carbon as the -SH 
subsiituent, then the position trans to the -SH, and finally cis to -SH. 

The above notwithstanding, any name enclosed completely in 
parentheses takes its usual structural meaning. 
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Here are structural descriptions for each rjame abbreviation in the 
above table, mosfly in SLN (SYBYL Line Notation), listed 
alphabetically. (SLN extends SMILES with the following concepts, 
among others. Hydrogens are explicit. Ring openings and closures 
begin with a number enclosed by [) and end with the matching 
number preceded by @ . Other SLN symbols used in these SLN 
definitions are: - = any bond; - = single bond (used here to provide a 
reference for [R]) : = aromatic bond; ! = tiie SLN following (here m 
parentheses) is ast allowed; [F] = no additional atoms may be 
attached to tiie preceding atom; [IR] = preceding bond may not be in 
a ring; [R] = preceding bond must be in a ring.) 

Shet = 5Het = C[l]:C'.C:C:C:m. alkenyl = C=C. alkyl = C-[\R]C. aryl = 
Ar = Phe = Pb = CI11.C:C:C:C:C@ 1. benzyl = Bz = HSC-(!R]C-IR]C. Bu = 
C-r«RlC-r!RlC-r!RlC-[!R]C. cyclohexyl = Cyhx = C(1](-I=)C-C-C-C-C-@1. 
CYciopentyl = C[ll-(-l=)C~C-C-C-@l. Et = C-[!R1C. inden = 
C[11C(-C-X-[2]):C(-@2):C:C@ 1. iBw = C-[!RlC-[lR]G(-[!RlC)-t!RlC. iPe = C- 
r!RlC-[!R]C-[!RlC(-l!R]C)-l!RlC. Me = C.naphth = 
C[1]:C(-C-X-[2]):C(-@2):C:C:C@1. NoH = !(CH). O denotes nng fusion. 
CO benzo fuses a 6-membered aromatic ring. Pe = C-I!R1C-['.R]C- 
r'RlC-r'RIC-r'RlC. Pr = C-t!RJC-[!RlC-[!R]C. R# = alkyl chain of 
a^loximlte length #. Simple = !(C-l!R]C). sPe =^C(-[!R1C)-[!R1C^^^^ 
[!RlC-r!R]C SUlbenyl = C=t!RlC-[!R]C[l]:C:C:C:C:C@l. tBu = C(-(!R1C)(- 
[!RlC)-t!R3C. 



(i^fooc^.^ ec^^oct 
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The following rq>laces section E contained in the priority applications. Not all of what was 
5 previously in E is included here, because the latest versions of BUILD_3D etc. are 
provided sqiarately in Section A. 

The first phase of construction of a combinatorial library takes 

as input a description of the chemical transformation represented 
10 by that combinatorial library and a list of available reagents such 

as the Available Chemical Directory (ACD)» and produces as output 

all the part structures (aka substructures or fragments), in product 

form, found in the list of available reagents which are appropriate for 

the chemical transformation, along with all structure-invariant 
15 physicochemical properties of those fragments that might be useful in 

diversity design (Optiverse) or searching (VL). 

In the course of this process, data are recorded permanently into three 
tables: 

REACTIONS (a Molecule Spreadsheet) = information about a 
20 reaction scheme. Each record corresponds to a reaction, 

where PanLabs or the manager of the VL designates 

what is a reaction. A typical reaction would be: 

"reaction of each nitrogen of a diamine with various 

reagents such as acids (acylation) or ketones (reductive 
25 amnination)". 

REAGENTS (a Molecular Spreadsheet) = information about a 

particular set of reagents used in some instance of a 

reaction. Each record corresponds to a particular logical 

reagent structure search in a database such as Available 
30 Chemical Directories, presumably a set of reagent structures 

which will all react in the same way. For example, there are 
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sixteen reagent records for the diamine reaction, enumerating 
each of eight reactant classes that might react with 
each of the two nitrogens. One record for example describes a 
reaction with epoxides, that could be ring opened nucleophilically 
(and regioselectively) by an amine to yield a beta-amino alcohol. 
RDATA. (an Oracle Table) = invariant phydcochemical data computed about 
compound fragments, typically the varying portions in a 
cSLN, with one record for each fragment encountered in ANY 
cSLN constructed. Thus data need not be recomputed when such 
fragments are reencountered, a substantial savings in processing time. 
For eixample, records will be added describing the properties of 
a -CH2CH(OH)R chain (product fragment) for each (new) epoxide-R 
reagent retrieved by the example record just given for the REAGENTS 
spreadsheet. 

Entering a new reaction into the system involves adding a new row to 
REACTIONS and at least two new rows to REAGENTS, by hand. This data 
entry operation is the only required data entry in preparation for virtual 
iibraiy production. 

All other operations on these entities are carried out by the SPL script 
getacd.core, executed within SYBYL. This script is reproduced below in its 
entirety. 

The major overall ouq)ut of getacd.core is a set of files for a reaction, 
whose base (file set) names are constructed by concatenating record numbei-s 
from the REACTIONS and REAGENTS tables, and whose prefixes are as follows: 

.files = explicitly contains the names for all other files. 

.csln = the template or prototype for the construction of a particular 
cSLN. If there is more than one possible core for a particular reaction, their 
structures and properties are recorded in the optional .cores file. 

.X1,.X2,.. = a "hitlist" having an SLN with property contributions 
for each unique fragment or variation at a particular position. Each 



wo 97/27559 PCT/US97/01491 

204 

variation site has its own hitlist file. 

.cores - similar to an .Xn file, but describes available variations 
in a cSLN core. For example, the .cores file for the diamine reaction lists 
SLNs of the cores and properties that each of the commercially available 
S diamines would contribute. 

Two intermediate data tables are used in some of the operations^f 
getacd.core, as molecular spreadsheet 

HITS = results of a particular reagent search, also records 
information about supplier, catalog number, and price. 
10 RSCRATCH - a "work table* used for calculation of side chain 

properties. 

To aid in understanding the getacd.core SPL script which follows, here are 
descriptions of the individual "columns** (aka attributes, fields) for each 
of the tables introduced above. 

15 REACTIONS: 

NAME (text) For user recogJiition only 

CLASS_ID (integer) A "glcAal" identifier for a particular reaction scheme 
VARIATION (integer) Can be more than one per CLASS JD, intended to distinguish 
among different reaction conditions for a particular reaction. This 
20 value is the key linking REACTIONS and REAGENTS 

NREAG (integer) Number of rows in REAGENTS for this reaction. Used only for 

checking self-consistency of user input. 
CORE^SLN (text) The SLN of the core for this reaction, along with information 
needed by the cSLN builder to correctly attach side chains, or, especially, 
25 to correctfy merge polyvalent variations with an invariant core. 



example of a record (diamine reaction, producing the R5V2Rn fileset), 
broken into two lines for clarity: 

1 2 3 4 
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NAME 



CLASS JD VARIATION NREAG 



5 Row5 Pipetazine 



5 



2 



16 



5 



CORE SLN 



5 Row5 N(l](Xl)CH2CH2N(X2)CH2CH2@i 2,X4Ri = l;10,X2Rl=9 



REAGENTS: 

ID (integer) invariant identifier for this record 

VARIATION (integer) link to REACTIONS by many-to-one relation 

SEARCH_SLN (text) SLN for the reactive fragment, which any reactant' 
molecule (e.g., within ACD) must contain in order to undergo 
the particular reaction 

NOTLIST (text) combination of SLNs and files (of additional SLNs) 
for fragments that must NOT be contained within any reactant 
to be used in this reaction. (Reasons include interference 
with this or other reactions in the sequence, or toxicity 
to biological systems.) 

PRUNE_SLN (text) similar and usually identical to SEARCH_SLN but 
may not contain any atoms or bonds of type "Any", needed 
while processing the individual reagent to overcome some 
quirks in SLN processing within S YBYL. 

SAME_AS (text) a hitlist file name. If present, this file's contents 
are used instead of an explicit reagent search that need not 
be done. (For example, the list of acids that react with 
piperidine are identical for each of the two nitrogens.) 

HOW (text) a series of structural modification commands which the 
script uses to convert a reactant structure into the corresponding 
atoms within the product. Atom ID references within these 
commands are sequence numbers of that atom within the PRUNE SLN, 
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or to names of atoms generated in a previous command. 



Example: An isocyanatc (PRUNE^SLN is CN=C=0) becomes most of a urea 
(CNHC(=0)Xi) wh^ reacted with an amine. Here is the HOW for this 
transformation: 



5 BREAKB,2,3 ATYPE.2,N,am ATYPE,3,C:2TILLV,2,H.ArFILLV.3,H,A2 MARKX,A2 

(reading left to right: the N=C becomes single; the N is made trivalent; 
the C is made sp2; hydrogen named Al is added to the N; hydrogen named A2 
is added to the C; the A2 is marked as designating a "free valence" whenever 
a cSLN is expanded.) 

10 ATTACHED (text) the file extension for the output file of cSLN Variations 
that this record produces. 

TEMPLATE (text) for polyvalent variations only, information needed to 
build an aligned topomeric conformation, as follows: 

Argument 1: a file containing a pre*aligned 3D structure. 
15 Argument 2: the SLN of the template within the 3D structure produced 

by joining the reactant molecule to the pre-aligned structure. 
Argument 3: the name of an SPL macro that performs any additional 

structural operations needed to g^erate the topomeric conformation. 
Argument 4: Any additional arguments to be passed to the macro named 
20 in argument 3. 

Example: aram.mol2 NH=CHCH2C(:Any):CH ACD!FIX_FUSE 10,11 

VALENCES (integer) the number of valences within each of these variations. 
FGPT_XTRA (text) for optimal fingerprint estimation, the SLN for any atoms 
that this particular record will ALWAYS add to the core. For example. FGPT_XTRA 
25 for the isocyanate acylation example in HOW is: C(=0)NHC 

EXAMPLE: Here is the record for the reductive amination reaction in which 
a carbonyl (aldehyde or ketone only) is condensed with a primary or secondary 
amine and then reduced to the amine with borohydride. 
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12 3 

ID VARIATION SEARCH_SLN 

13ROW13 13 2 HcC(=0)C(-:Any)-:Any{Hc:HjC(-:Any)} 

5 4 5 

NOTUST 



13 ROW13 badIs.kaJ 0=COtfl 0=COH 0=CC=0 0=CAnyC=0 \ 
HcC(=0)C(-: Any)(-: Any).HcC(=0)C(-:Any)(-: Any){Hc:H | C(-: Any)-: Any} \ 
10 NH(Hc)C(Any)Any{Hc:H | C(Any)Any} 

5 6 
PRUNE_SLN SAME AS 



13 ROWI3 HcC(=0)C{Hc:H|C} 
15 7 

HOW 



13 ROW13 BREAKB,2.3 DELA,3 ATYPE.2,c.3 FILLV,2.H,A1 MARKX.Al 

20 8 9 10 

ATTACHED TEMPLATE VALENCES 



13ROWi3XI 
11 

25 FGPT_XTRA 
13 ROW13 CHC 



30 



RDATA 

CRC (number(lO.O), primary key) a "cyclic redundancy code", used most 
often to verify the integrity of data communication packets, generated here 
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from the SLN to enable fast exact substructure match searching of an Oracle 
table. (Rare ties in CRC values fOr non-identical SLNs are broken by appending 
<name=junk> to the duplicate-generating SLN and attempting to reregister 
until a unique CRC is generated.) 
5 SLN (text) SLN of a fragment, open valence(s) at point(s) of attachment. 

LOOP (NUMBER(6,2)) logP of the fragment, calculated for the structure 
where all open valaices are filled with H's, A value of^.99 denotes 
"could not be calculated". 

MW (NUMBER(10,2)) molecular weight of the fragment exactly as described 
10 by the SLN. A value of -1.0 denotes "could not be calculated". 

TOPOMERIC (text) a textual rq)resentation of the CoMFA steric field for 
the topomeric conformation of this molecule. (The 3D SLN of this conformation 
is written to a file in the fileset with extension .fal, for possible future 
reference.) 

15 NROTBONDS (NUMBER(2,0)) number of bonds whose torsional values were set 
for this side chain during generation of the topomeric conformation. 

PH_AS, PH_DL, PH_DS, PH_AL, PH_AR (NUMBER(2,0)) number of pharmacophoric 
points within this side chain, of different classes as defined by DISCO and 
SYBYL 6.3/Unity 2.6, 

20 # following are definitions of oracle queries used for referencing table RDATA 
# within SPL, 

RDBMS REFERH^CE DEFINE oracle_rdata tripos oracle castor \ 

MACHINE_ACCESSJNFO explicit_userid lawless explicit jiassword jlu816yl \ 
RDBMS_ACCESS JNFO EXPLICIT^USERID adsvl explicit jjassword adsvl \ 
25 DONE 

RDBMS QUERY DEFINE RDATA_DATA oracle_rdata 

select SLN,LOGP,MW,TOPOMERIC,NROTBONDS from RDATA where 
CRC=:NEW_CRC 
#. 

30 DONE 



wo 97/27559 



PCT/US97/01491 



209 



ALL THAT REMAINS IS GETACD.CORE AND CERTAIN FILES FROM 
CHOM_BATCH.CORE 



S S There-are only two important user entry points 

# "optiv" for most purposes 

# *'cores" for building the .cores file (to be rq)laced) 

©macro optiverse sybylbasic 

10 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 

U sets global state variables, then dispatches tasks in order 
# 

ii^ $1 is a set of reaction IDs 
15 # $2 is a set of modifiers (variations to be skipped, NoSearch, Test, ) 
ft TEST = only the first item in each hiilist is processed 
it (allows checking out all input data quickly) 

# DEBUG = uims ver on at all times 

# RONLY ^ only process specified rows in REAGENTS 
20 # NOSEARCH = skip search (hit lists must already exist in 

# woildng directory) 

# NOCAT = skip concatenation of Xn files 

# SEARCH = ONLY do search 

# BUILD = ONLY convert hiUists to Xn files 
25 # CSLN = ONLY build CSLN template 

# CORES = ONLY do core search and processing 
If numeric values - two interpretatioons 

# if RONLY, these are the ROW IDS in the REAGENT MSS to use 

# if not RONLY, these are VARIATIONS to NOT process 
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globalvar ACD!cmd ACDIdb ACD!inited CHOMJErr ACDiPool ACD!Xs \ 
ACD!DoSearch ACD!Tesl ACDIPrice ACDIPassword \ 
ACD!PrefeiTCd_suppHer ACDlqprop \ 

ACDIcost ACDlsupjrfier ACDIFCD ACD!Only_Rs ACD!NoCAT ACD!Sites 
5 iocalvar nrg rxrow rcrows v vars rxn dosearch dobuild docsin docores 

setvar aigs2 %uppercase( "$2* ) 

setvar ACDlDoSearch %not( %set.and( NOSEARCH "Sargsa* ) ) 
setvar ACDITest %set_and( TEST "$argi2- ) 
setvar ACD!Price %not( %set_and( TEST "Sargsl" ) ) 
10 setvar ACDINoCAT %set_^and( NOCAT ''$args2" ) 

# initialize other data if not done in a previous optiverse run 
if %nol( $ACb!inited ) 

ACDinit 

if %not( $ACD!inited ) 
15 return ^ 

endif 
endif 

setvar ACD!only_rs 

• if %set.and( RONLY -$args2* ) 
20 setvar ACD!only_rs $args2 

endif 

setvar dosearch TRUE 
s^ar dobuild TRUE 
25 # next is obsolete 
setvar docsin 
setvar docores TRUE 

setvar procs %set_and( SEARCH3UILD,CORES,CSLN "$args2" ) 
ifSprocs 

30 # if subprocess(es) specified set all false, only those specified on 
setvar dosearch %set_and( SEARCH "$args2" ) 
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setvar dobuild %set_and( BUILD "$args2- ) 
setvar docsln %set_and( CSLN -Sargsl- ) 
setvar docores %set_and( CORES "$args2" ) 
endif 

for rxn in %set_unpack( $1 ) 
setvar vars %tblsrch_val(-REACnONS CLASS JD $ncn ) 
if %not( Svars ) 

%dial9g_message( ERROR \ 
•REACTIONS has no entry for a Class ID of: $rxn" "Bad REACTIONS Data" ) 
return 
endif 

' if %set_and( DEBUG -$args2" ) 
uims ver on 

else 

uims ver off 
endif 

%file_deiete( startup.pho ) >$nulldev 
photo on startup.pho >$nuUdev 

setvar nv 1 
for v in Svars 
if allow variations to be skipped 

if %or( "%not( $args2 y "%not{ %set_and( "$v" ''$args2" ) )" ) 
echo Variiation $nv (ID: $v) of %count( Svars ) 
setvar nv %math( Snv -f- 1 ) 

TABLE DEFAULT REACTIONS 
setvar nrg %rcell( $v NREAG ) 
setvar rcts %rcell( $v VARIATION ) 

setvar rcrows %tblsrch_val( REAGENTS VARIATION Srcts ) 
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if 9&not( %eq( $nrg %count( Srcrows ) ) ) 
%diaIog_niessage( ERROR \ 
"For Variation $v of Reaction $rxn,\ 
REAGQ4TS has %count( Srcrows ) iows\ 
5 but REACTIONS specifies $nrg reagents" \ 
"Bad REACTIONS or REAGENTS Data" ) 
return 
endif 

if $ACD!only_rs 

10 setvar svrows %set_unpack( "%set_and( %set_create( Srcrows ) \ 

$ACD!only_rs )" ) 
if %not( Ssvrows ) 

echo No reactant classes to be searched or built for Reaction $rxn 

endif 
15 else 

setvar svrows Srcrows 
endif 

if Sdosearch 

get_acd $rxn-$rcts %set_create( Ssvrows ) 
20 endif 

if Sdobuild 

trsljacd Srxn Srcts %set_create( Ssvrows ) 
endif 

%file_delete( fmish.pho ) >$nulldev 
25 photo on fmish.pho >$nulldev 

i CSLN file generation is obsolete 
ifSdocsln 

csln_files Srxn Srcts %set_create( Srcrows ) 
endif 

30 ifSdocores 

cores Srxn Srcts %set_create( Srcrows ) 
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endif 
endif 

ACDIRxnUpdate $rxn $rcts 
endfOT 
S endfor 

uims ver off 
photo off 

#. 

@macro get_acd sybylbasic 
10 # 



# do reagent searches in ACD for all specified rows in reagents 

locaivar fct rg sfirag buff bf hfname 

15 TABLE DEFAULT REAGENTS 
setvarrcrows %set_unpack( $3 ) 
fo"" n Srcrows 

setvar sfrag %rcell( $rg SEARCH_SLN ) 
setvar hfname %cat( R $1 V $2 R %rcell( $rg ID ) ) 
20 setvar ofname %rcell( $rg SAME^AS ) 
if %streql(-^"$ofname" ) 

setvar ofname 
endif 

if Sofhame 

25 setvar ofname %substr( Sofname 1 %math( %pos( "/ Sofname ) - 1 ) ) 

endif 

if %or( -$ACD!DoSearch- "%not( %file_exists( %cat( Shfname .hits ) ) )" ) 
if %and( "Sofname" "%file_exists( %cat( Sofname .hits ) )" ) ' 
# del /bin/cp %cat( Sofname .hits ) %cat( Shfname .hits ) 
30 else 
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# prq>are notlist file 

setvar notf %open( %cat( Shfname .bad ) "w* ) 
for not in %rcell( $ig NOTUST ) 
if %file_exists( Snot ) 
5 # write out all bad fragments NOT CONTAINED by SEARCH FRAGMENT 
setvar bf %open( Snot "r* ) 
while %not( %eof( $bf ) ) 
setvar buff %read( Sbf ) 
if %and( ''%noi( %eof( Sbf ) )" \ 
10 -%not( 96streql( -%substr{ "Sbuff" 1 1 )" ) )- ) 

# Any in Ssfarg allows metals to fall through, so Any cannot exclude a frag 



setvar notin %not( %search2d( Ssfrag \ 

"Sbufr NoTriv 1 y ) ) 
if %or( "Snotin" '*%and( -%not( Snotin )" \ 
15 -SigK %sln_atom_count( "SbufP ) 1 )" )" ) 

%write( Snotf Sbuff ) >$nulldev 



else 



echo Not excluding Snot fragment Sbuff (contained in Ssfrag ) 



endif 



20 



endif 



endwhile 



%close( Sbf ) 



else 



%write{ Snotf Snot ) >SnuIldev 



25 



endif 



endfor 



%close( Snotf ) 



30 



# prepare query file 

setvar notf %open( %cat( Shfname .query ) "w** ) 
%write( Snotf Ssfrag ) > Snulldev 
%close( Snotf ) 



# do search (first time for individual components, second lime to filler 
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# umlticomponent qxis retrieved) 

echo Searching for %icell( $rg SEARCH_SLN ) 
setvar dbs del $ACD!cfnd -database $ACD!db -qfile \ 
%cat( Shfhame .query ) -notlist \ 
%cat( Shfhame .bad ) -hitlist tmp.hits -coords 
if$ACD!Test 

setvar dbs $dbs -maxhits 10 
endif 
Sdbs 

setvar dbs del $ACD!cmd -database tmp.hits \ 

-dbtype sin -qfile %cat( Shfname .query ) \ 

-notlist %cat( Shfname .bad ) -hitlist %cat( Shfname .hits ) 

Sdbs 
cndif 
endif 
endfor 

#. 

©macro trsl_acd sybylbasic 



S prepare Xn files, ensure properties are recorded for all side chains 
globalvar ACDlCycFrag 

localvar rcrows ma xls hfname how patin template xfile fname 
localvar f fi rg valences h nout XRgs allcrc crc 

setvar rcrows %set_unpack( $3 ) 
setvar ma Ml 
setvar ACDIXs 
setvar ACDIPool 
setvar xls 
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setvar xs 

TAILOR SET MAXIMIN2 MAXIMUM^ITERATIONS 1000 | | 
setvar split_atms 
setvar XRgs 
5 setvar sun> 

ACD!INrr_Std_Topomer 

# reset GRC uniqueness checker 

%CRC_NCIT_UNIQUE( junk junk) >$nulldev 

S for all reagents in this variant of this reaction 
10 for rg in $rcrows 

TABLE DEFAULT REAGENTS 
setvar nout 0 

setvar hfname %cat( R $1 V $2 R %rcell( $rg ID ) ) 
%file_delete( %cat( Shfname .pho ) ) > Snulldev 
15 photo on %cat( $hfname .pho ) >$nulldev 

setvar xfiie %rcell{ $rg ATTACHED ) 
setvar ACDIXs %set_or( "$ACD!Xs" $xfile ) 
setvar XRgsI Sxfile ] $XRgs[ Sxfile J $rg 

setvar fname %cat( $hfname Sxfile ) 
20 setvar ofname %rcell( $rg SAME_AS ) 

if %streql( "Sofname" ) 
setvar ofname 

endif 

setvar do_copy 
25 if Sofname 

if %not( %streql( ^'Sofname" "H.Xl" ) ) 

setvar p %substr( Sofname %math( %pos( R \ 
%substr( Sofname 2 ) ) + 2 ) \ 
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%inath( %strlen( Sofname ) - %pos( $ofname ) ) ) 
setvar rid %tblsrch_val( REAGENTS ID $p ) 
endif 

setvar do_copy %file_exists( Sofname ) 
endif 

# may only need to copy a previous version of **.Xn, if it's there 
if $do_copy 
else 

setvar fgpt^xtra %rcell( $rg FGPT__XTRA ) 
setvar uname %rcel)( $rg USER^NAME ) 
setvar falign %open( %cat( Shfname ",fal" ) "a" ) 
setvar foracle %open( %cat( Shfname **.ora" ) "a" ) 

setvar how %rcell( $rg HOW ) 
if %not( Show ) 

echo No HOW specified for row Srg in REACTANT table 

goto nxtreactant 

endif 

setvar ACD!FixGeom %pos( CLIP "Show" ) 
setvar ACDICycFrag 

setvar patin %rcdl( Srg PRUNE_SLN ) 
setvar valences %rcell( Srg VALENCES ) 
for ats in %range( 1 Svalences ) 
setvar xls Sxls %cal( X Sats ) 
endfor 

setvar keep_ats 
setvar xats 

if %gt( %count{ Spatin ) 1 ) 
setvar xats %arg( 2 Spatin ) 
setvar keep_ats %arg( 3 Spatin ) 
setvar patin %arg( 1 Spatin ) 
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endif 



setvar template %rcell( $rg T^PLATE ) 
if Stemplate 
setvar split^atms %arg( 4 Stemplate ) 

setvar CHOM!AUgn( FIX^CF_CALLBACK } %aig( 3 Stemplate ) 
setvar CHOM!Align( SLN ] %ZTg( 2 Stemplate ) 
setvar template %arg( 1 Stemplate ) 
mol in m6 Stemplate >$nulldev 
CHOM!INIT_BUILD_3D M5 
endif 

setvar f %open( Sfname "w" ) 
if$fgpt_xtra 

%write( $f # %cat( FGPT_X= $fgpt_xtra ) ) >$nulldev 
endif 
if Suname 

%write( $f» %cat( USER_NAME= Suname ) ) >$nulldev 
endif 

setvar f 1 

# setvar fl %open( %cat{ Shfname ".base." Sxfile ) V** ) 
echo Translating hits for %rcell{ Srg SEARCH_SLN ) 

if %set_and( HITS %set_create( %table_name() ) ) 

echo ERror - HITS table already exists! 

return 
endif 

# read in the hitlist (it better be there!) and add price, FCD# columns 

TABLE CREATE hits unity Sma FROM^A^FILE \ 
%cat{ Shfname .hits ) 1 >$nulldev 
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if %not( %set_and( HITS %set_create( %table_nameO ) ) ) 
echo No HITS exist for %rcell( $rg SEARCH_SLN ) ! 
else 

if$ACD!Price 

table coIumn_ai^d rdbms tcdj)rice first price 

table coIumn_append rdbms tcd_suppliers first supplier 

table eval new ♦ PRICE,SUPPLIER 
endif 

setvar args %table( * ROW NUM ) 
setvar wrote! 
if processing all the hits 
for h in $args 

echo $h 
^ table default HITS 
setvar ailsln %sln_get_sln_from_table{ HITS $h ) 
M skip isotopicaliy labelled reagents 
if %pos( "[I=- "Sallsln" ) 

echo Skipping isotopicaliy labelled Sallsln 
goto nxt_rxnb 
endif 

setvar pat %search2d( Sallsln Spatin NoTriv 1 y ) 

# break up compound SLN into molecular componoits 

setvar p %pos( Sallsln ) 
while $p 

setvar ailsln %substr( "Sallsln" 1 %math( $p - I ) ) \ 

%substr{ "Sallsln" %malh( $p -h 1 ) ) 
setvar p %pos( "Sallsln" ) 
endwhile 

# cycle through any components until we get the RELEVANT molecular component 

for cpsln in Sallsln 
setvar cpsln %fix_acd( Scpsln ) 
setvar pat %search2d( Scpsln Spatin NoTriv 1 y ) 
if Spat 
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setvar crc %sln jo__crc( $cpsln ) 
# check for within-hitlist duplicate of previous reagent providing same side chain 
if %CRC_NOT_UNIQUE( $crc ) 

echo Duplicate reagent SLN ski}q)ed \ 

(supplier $supp) $cpsln 
goto nxt_rxn 
endif 

DEFAULT $M A >$NUlldev 
%sln_to_mol( $ma $cpsln ) >$nulldev 
if $AGD!FixGeom 

if %not( %chom_concord( $ma ) ) 
goto nxt_rxn 

endif 

endif 

setvar ats %acd_do_rxn( $ma Spatin Show ) 
if %not( $ats ) 

goto nxt_rxn 

endif 

setvar nowsin %slnjabelx( $ma $xls ) 
setvar px %pos( X Snowsin ) 

tt convert R*s into X*s (should probably be in C) 

while Spx 
J? check for isolated X's in ACD input 

if %not( %set_and( "%substr( Snowsln \ 

%malh( $px + 1 ) I 1,2,3,4,5,6.7,8,9 ) ) 
echo Input contains Isolated X - reactant discarded 
goto nxt_rxn 
endif 

setvar nowsin %cat( %substr( Snowsln 1 %math( Spx - 1 ) ) R \ 

%substr( Snowsln %niath( Spx + 1 ) ) ) 
setvar px %pos( X Snowsln ) 
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endwhile 

# must ensure that every Rx is unique 

setvar ret 1 

setvar px %pos( %cat( R $rct ) Snowsln ) 
5 while $px 

setvar xs %set_or( "Sxs" %cat( X $rct ) ) 
setvar py %po$( %cat( R $rct ) \ 

%substr( Snowsln %math( $px + 1 ) ) ) 
while Spy 

10 setvar nowsln %cat( %substr( Snowsln 1 \ 

%math( Spy + $px ) ) %math( $rct + 1 ) \ 
%substr( Snowsln %math( Spx + Spy -h 2 ) ) ) 
setvar py %pos( 96cat( R Srct ) \ 

%substr( Snowsln %niath( Spx + Spy ) ) ) 

15 endwhile 

setvar ret %math( Srct + 1 ) 
setvar px %pos( %cat( R Srct ) Snowsln ) 
endwhile 

# check again for within-hitlist duplicate of previous reagent providing same side chain 
20 setvar crc %s!n Jo_crc( Snowdn^) 

if %CRC^NOT_UNIQUE( Sere ) 

echo Duplicate side chain SLN skipped: Snowsln 
goto nxt_ncn 

endif 



25 ifSACDiPrice 

setvar nowsln %cat( Snowsln "<FCD=- %table( Sh ROW NAME ) \ 
";PRICE=" %rcell( Sh PRICE ) ^SUPPLIER=" ) 
if identify any preferred supplier present 

setvar supp %uppercase( %rcell( Sh SUPPLIER ) ) 
30 setvar supp %ACD_Get_Preferred_Supplier( Ssupp ) 

if %not( Ssupp ) 

setvar nowsln %cat( Snowsln ) 
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else 

setvar nowsln %cat( $nowsln $supp ) 
endif 
else 

5 setvar nowsln %cat( Snowsln <FCD=" %table( $h ROW NAME ) ) 

endif 

# we have our SLN, now need to go off to RDATA to retrieve (find or generate) properties 

copy $nia M2 
default M2 >$nulldev 
10 # generate fragment(s) for identity search 

# NOTE - removal of reagent atoms may have split reagent up into independent fragments 

remove atom %set_create( %atoms( $xs ) ) > $nulldev 
setvar fsln %sln( M2 UNIQUE ) 
setvar p %pos( " < " $fsln ) 
15 if$p 

setvar fsln %substr( Sfsln 1 %math( $p - 1 ) ) 
endif 

setvar p %pos( Sfsln ) 
while $p 

20 setvar fsln %substr( "Sfsln" 1 %math( $p - 1 ) ) \ 

%substr( "Sfsln" %math( $p -f 1 ) ) 
setvar p %pos( "Sfsln" ) 
endwhile 

# because there may be multiple fragments per reactant, must sum over 
25 # these to get property values 

setvar tlogp 0 
setvar tmw 0 
setvar trb 0 
setvar tcmf 
30 # cycle through 1 or more fragments 

# for each, search Oracle table via CRC for a previosu occurrence 

for sin in Sfsln 
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# check for multiple binding of THIS fragment 

setvar ACD!cycfrag 
if %gt( Svalences 1 ) 

%sln_to_mol( M4 $sln ) >$nulidev 
5 default M4 >$nuUdev 

setvar nat %molJnfo( M4 NATOMS ) 
FILLVALENGE * H 1 1,09 1 1^1 1.09 >$nulldev 
sovar ACDlcycfrag %gt( %math( \ 

%molJnfo( M4 NATOMS ) - $nat ) 1 ) 

10 endif 

# if a fragment closes a ring, must use the input conformation 

if SACDIcycfrag 

# identify the atoms to be extracted 

setvar cycpat %search2d( %sln{ $ma ) $sln NoDup 1 y )' 
15 setvar extract %set_create( %sln_rgroup_sybid( \ 

$ma Scycpat %range( I %sln_^atom_count( $sln ) ) ) ) 
EXTRACT %cat( $ma Sextract ) M4 >$nulldev 
if %not( $ACD!FixGeom ) 

echo WARNING: Side Chains are joined \ 
20 in a reactant Sallsin but CLIP is not in HOW 

endif 

else 

%slnjo_moi( M4 $sln ) >$nulldev 
endif 

25 setvar sin %sln( M4 UNIQUE ) 

setvar ct 0 

sln_modified: 

setvar crc %sln_to_crc( $sln ) 

# find RDATA record - have properties already if present 

30 if %not( %streql( %RDBMS_SetBindValue( \ 

$ACD!qprop NEW_CRC $crc ) TRUE ) ) 
echo RDBMS Set Bind VAlue failed - quitting 
return 
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endif 

setvar havel 

setvar matches %RDBMS_BindQuery( $ACD!qprop ) 

setvar EOQ 

while %not( $EOQ ) 

setvar rdata %RDBMS_ReadQuery( $ACD!qprop \ 
•%s %f %f %s %r ) 

if %RDBMS_ErrorO 

setvar EOQ true 
else 

# trim previously stored SLN of any <name= before checking for string match 
setvar sln_noname %arg( 1 Srdata ) 
setvar p %pos( " < " $sIn_noname ) 
if $p 

setvar sln_noname %substr( \ 

$sln_noname 1 %math( $p - 1 ) ) 

endif 

if %streql( $sln $sln_noname ) 
setvar havel TRUE 
break 

else 

echo Different structures have same CRC's - renaming 

setvar p %pos( " < " $sln ) 

if$p 

setvar sin \ 

%substr( $sln 1 %math( $p - 1 ) ) 

endif 

setvar ct %math( Set + 1 ) 
setvar sin %cat( $sln " <name=DUP" $ct " > - ) 
goto sln_modified 
endif 
endif 
endwhtle 
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# if fragment not in Oracle table, calculate, then store, fragment properties 

if %not{ Shave 1 ) 

echo Adding Ssln to RDATA 
if %streql( SH Ssln ) 

goto nxt_rxn 
endif 

seivar vais-%ACDcalq)rop( Ssln $ma \ 
Svalences Sfalign $split_atms ) 

if 96not( Svals ) 

echo Physical data not calculable for Ssln 

goto nxt_rxn 
endif 

setvar rdata Ssln Svals ^set^sizeC^SCHOMIAlignC RBDS ]" ) 
if %not( %rdbms_transactionstart( oracle_rdata ) ) 

echo RDMBS_TRANSACnONSTART failed - quitting 

return 
endif 

# building SQL command to do Oracle INSERT 

setvar cmd %cat( Sere Ssln Svalences \ 
%arg( 1 Svals ) "/ %arg{ 2 Svals ) \ 
%arg( 3 Svals ) %set_size( \ 
-SCHOMIAIignl RBDS ]• ) 
MWLL,NULL,NULL,NULL,NULL)- ) 

setvar cmd insert into RDATA VALUES Scmd ; 
if %nol( %rdbmsjransactionCommand( oracle rdata " Scmd ** ) ) 
echo Addition of side chain to Oracle RDATA table failed - 

Quitting 

return 
endif 

if %not( %rdbms_transactioncommit( oraicle rdata ) ) 
echo Transaction Commit failed ~ quitting 
return 
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endif 
endif 

# accumulate Logp, MW, rotatable bonds -* if any is NULL, overall value is NULL 

setvar Uogp % ACD_add( $iiogp %arg( 2 Srdata ) 99.99 ) 
5 setvar tmw %ACD_Add( $tmw %arg( 3 Srdata ) -LO ) 

setvar trb %ACD_Add( $trb %ar]g( 5 Srdata ) -1.0 ) 
if %ana( "%not( %stieql( "StcmP NULL") )- \ 

-%not( %streql( •'%aig( 4 Srdata )" NULL ) )- ) 
setvar tcmf %cat( Stcmf %arg( 4 Srdata ) ) 

10 else 

setvar tcmf NULL 
endif 
endfor 

# fmished checking all firagments within a reagent from HITS 

15 # output side chain structure for CSLN construction on 1st pass only 
ifSfl 

%write( $fl %substr( Snowsin 1 %math( %pos( \ 

- <" Snowsin ) - 1 ) ) ) >$nulldev 
%close( $fl ) >$nulldev 
20 setvar f 1 , 

endif 

# keep building output string for .Xn file Null values are represented by blanks^ 

setvar ACDISLN Snowsin 
ACDladdval MW -1 .0 Stmw 
25 ACDIaddval RBD -1 Strb 

ACDIaddval LOG? 99.99 Stlogp 
ACD!addval CTOPS NULL Stcmf STRCMP 



setvar nowsln %cat{ SACDlsln " >" ) 
%write( $f Snowsin ) >$nulldev 
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setvar nout %math( Snout + I ) 
setvar wrotel TRUE 

# write out data for future Oracle table matching RDATA to its uses in CSLN libraries 

if %not( Ssupp ) 
5 setvar supp NULL 

endif 

TABLE DEFA HITS 
setvar price %rcell( $h PRICE ) 
if %not( Sprice ) 
10 setvar price NULL 

endif 

%write( Sforacle $crc $1 $2 $rg %table( $h ROW NAME ) \ 
SPRICE Ssupp) >Snulldev 

# only record first occurrence of a component containing the fragment 
15 break 

endif 

nxt_rxn: 

endfor 

if %and( "Swrotel" "SACDITesf ) 
20 break 
aidif 

nxt^rxnb: 

endfor 

# finished all HITS !! 

25 if Stemplate 

ACD!INIT_^Std^Topomer 
setvar template 
endif 

%close( Sfalign ) 
30 %dose( Sforacle ) 

%close( $f ) >$nulldev 
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ACDIrecord REAGENTS $fhame $rg VARIANTS UPDATED 
TABLE CLOSE hits NO >$nuUdev 
echo Snout variations written to Sfname 
endif 
5 nxtreactant: 

photo off 
endif 
endfor 

# %rdbms_cIose( oracle_RDATA ) >$nuUdev 
10 #. 



©macro record ACD 




15 # count how many variations arc referenced by the new CSLN 
if %not( $ACD!Test ) 

TABLE DEFAULTS! 

del "wc $2 > junk.txf 

seivar f %open( junk.ut r ) 
20 setvar buff %read( $f ) 

echo %wcell( $3 $4 %arg( 1 Sbuff ) ) >$nuUdev 

echo %wccll( $3 $5 -%timeO" ) >$nulldev 

%close( $f ) > Snulldev 

s^var f %table_attribute( FILENAME ) 
25 echo SAVING $f 

%file_delete( $f ) > Snulldev 

TABLE SAVE $f 
endif 

#. 



30 ©macro RxnUpdale ACD 
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# count and save how many products 
if %not( $AGD!Test ) 
setvar nprod 1 

setvarjirg %tblsrch~val( REACTIONS CLASS_ID $1 ) 
if $XTg 

if %eq( 1 %rcell( $xrg MORE_CORES ) ) 
table default cores ' 
setvar nprod %rcell( %tblsrch_val( \ 

GORES CLASSJD $1 ) VARIANTS ) 
if %not( Snprod ) 
echo No VARIANTS value for CORES file for CLASSJD $1 
return 
endif 
endif 
endif 

TABLE DEFAULT REAGENTS 
setvar ACDIXs 
setvar XRgs 

for rg in %tblsrch_val( REAGENTS VARIATION $2 ) 

setvar x «rcell( $rg ATTACHED ) 

setvar ACDIXs %set_or( "$ACD!Xs" $x ) 

setvar XRgs[ $x ] $XRgs[ $x ] $rg 
endfor 

for X in %sel_unpack{ $ACD!Xs ) 
setvar nvarO 
for var in $XRgs[ $x ] 
setvar nxvar %rcen( $var VARIANTS ) 
if %or( "%not( Snxvar )" "%It( "Snxvar" 1 )" ) 
setvar nxvar %rcell( $var SAME_AS ) 
if %streql( "Snxvar" ) 
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setvar nxvar 

endif 

if %not( Snxvar ) 

echo No variants value found or \ 

derivable for ID $var in REACTANTS 

return 
endif 

if %streqK$nxvar*H.Xl- ) 

setvar nxvar 1 
else 

setvar rg %pos( R %substr( Snxvar 2 ) ) 
setvar rg %substr( Snxvar %math( $rg + 2 ) \ 
%math( %pos( Snxvar ) - $rg - 2 ) ) 
setvar nxvar %rcell( %tblsrch_val( REAGENTS ID $rg ) \ 

VARIANTS ) 
if %not( Snxvar ) 

echo No variants value found or derivable \ 
for ID Snxvar in REAC 

TANTS 

return 
endif 
endif 
endif 

setvar nvar %math( Snvar + Snxvar ) 
endfor 

setvar nprod %math( Snprod * Snvar ) 
endfor 

TABLE DEFAULT REACTIONS 

echo Generated Snprod products 

echo %wcell( Sxrg SIZE Snprod ) >SnuHdev 

echo %wcell{ Sxrg UPDATED "%timeO" ) >$nuUdev 

setvar f %table_,attribute( FILENAME ) 

echo SAVING $f 
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%file_delete( $f ) >$nulldev 
TABLE SAVE $f 
endif 

#. 

S ©macro delval ACD 



S removes all instances of an attribute/value pair from an SLN 
10 globalvar ACD'SLN 
local var p pi 

setvar p %pos( $1 $ACD!SLN ) 
while $p 

setvar pi %pos( %substr( $ACD!SLN $p ) ) 
15 if %not( $pl ) 

setvar pi %pos( " > " %substr( $ACD!SLN $p ) ) 

endif 

setvar ACDISLN %cat( %substr( "$ACD!SLN" 1 %math( $p - 1 ) ) \ 
%substr( -$ ACDISLN" %math( $pl + $p ) ) 
20 setvar p %pos( $1 $ ACDISLN ) 

endwhile 

©macro addval ACD 
# 

# appends attribute value pair to ACDISLN in UNITY format, checking 

# for input values which simulate null values 
globalvar ACDISLN 

30 localvar isnull 
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# first remove all existing references/data 
ACDfdelval $1 

setvar ACDISLN %cat( $ACD!SLN $1 ) 
if %eq( $ff 4 ) 
S setvar isnuU %streql( $2 $3 ) 

else 

setvar isnull %eq( $2 $3 ) 
^dif 
if Sisnull 

10 setvar ACDISLN %cat( $ACD!SLN ) 

else 

setvar ACDISLN %cat( SACDJSLN $3 ) 
endif 

#. 

15 @expression_generator ACD_Add 
ff 



# adds a new value and returns sum, or returns the supplied code for NIL 
20 # if either old or new value already codes for NIL 

g need to truncate values retrieved from Oracle DB 
setvar arg2 $2 
setvar p %pos( $arg2 ) 
if$p 

25 if %gt( %strlen{ $arg2 ) %math( $p + 2 ) ) 

setvar arg2 %substr( $arg2 1 %malh( $p + 2 ) ) 
endif 
endif 

if %streql( $arg2 $3 ) 
30 %retum( $3 ) 
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else 

%retum( %math( $arg2 + $1 ) ) 

endif 
return 
5 #. 

©expression _gmcmop ACD_Get_Preferred_Supplier 



10 # identify "best** supplier, edit name as needed 
localvar p prefs supp 

setvar prefs %set_and( "$1" $ACD!Preferred_Supplier ) 
ifSprefs 

# if ANY suppliers are preferred, pick the best 

15 for p in %sel_unpack( $ACD!Preferred_Supplier ) 

setvar supp %set_and( $p $prefs ) 
if Ssupp 
break 
endif 

20 endfor 
else 

else just grab the first one 

setvar supp %arg( 1 %set_unpack( "$1" ) ) 
if %streql( "Ssupp" ) 
25 setvar supp 

endif 
endif 

# can't tolerate hyphens 

setvar p %pos( "-^ -$supp* ) 
30 ifSp 

setvar supp %cat( %substr( Ssupp 1 %math( $p - 1 ) ) \ 
%substr( Ssupp %malh( $p + 1 ) ) ) 
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endif 

%return( $supp ) 



@expression_generator ACD_corej>rops 
5 # 



it generate physicochemical data 
table default RSCRATCH 
10 echo %wcell( 1 1 %sln( Ml ) ) >$nulldev 
TABLE EVAL ALL 1 MW 

# note that Xn each have an "AW" of 12.01 1 - back these out! 

setvar mw %math( %rcel!( 13)- %count( $* ) * 12.011 ) 

# replace Xn by Me groups for best LogP estimate 
IS setvar sin %sln( Ml ) 

setvar p %pos( "X" $sln ) 
while $p 

setvar sin %cat( %substr( $sln 1 %math( $p - 1 ) ) \ 
CH3 %substr( $sln %math( $p + 2 ) ) ) 
20 setvar p %pos( "X" $sln ) 

end while 

edio %wcell( 1 1 $sln ) >$nulldev 
table eval all 1 CLCXiP >$nulldev 
setvar logp %rcell( 12) 
25 if %not( Slogp ) 

echo LogP not calculated for $sln 
setvar logp 99.99 
endif 

%retum( "Smw Slogp" ) 

30 #. 



@expression_generator SybID2SLN 
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» returns the (first) atom in the SLN that corresponds to a SYBYL ID » 
setvar targ %arg( 1 %set_unpack( 3 ) ) 
for 1 in %range( 1 %moLinfo( $1 NATOMS ) ) 

if %eq( $targ 95arg( 1 %set_unpack( %sln_igroup_sybid( $1 $2 $i ) ) ) ) 
%retum( $1 ) 
return 

endif 
endfor 

©macro ACDinit sybylbasic . - 



if read in MSS's, initiate database location and dbsearch engine 

globalvar ACD!cmd ACDIdb CHOM! Align ACD!inited ACDISLNin ACDISLNout 

setvar ACDIdb /conimon3/lawless/acd/acd_udb 

# other one is /ads/lawless/ ACD 

setvar ACDicmd /homeS/jilek/bin/dbsearch.ads 
set CGQ_timeout 0 

setvar TA_RDBMS_READ_TIMEOUT 50000 

# odd bond types get created, later overridden by Concord 

table recall reactions 
table recall reagents 
table recall cores 



# Oracle setup 
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take /home8/lawIess/tcd >$nulldev 

take /tinp_fnnt/net/sn/hoine4/cramer/panlabs/synplan/rdata >$nulldev 
if %not( %RDBMS_Open( oracle.rdata ) ) 

echo could not open Oracle table: RDATA with side diain data 
5 return 
endif 

setvar ACD!qprop %RDBMS_SetupQuery( oracle_rdata RDATA^DATA ) 
if %not( $ACD!qprop ) 

echo RDATA query could not be Setup 
10 return 
endif 

if $ACD!Price 

if %not( %rdbms_open( oracle jcd) ) 

echo ACD Price Oracle table not opened 
15 return 
endif 

endif 

ACD!INIT_TOPOMER 

setvar ACD!SLNin N[+11(=0)(0[-1]) N[+1](=0)0[-1] 
20 setvar ACD!SLNout N(=0)(=0) N(=0)=0 
setvar ACD!Preferred_supplier \ 

ALDRICH,SIGMA.FLUKA,LANCASTER,TCI-US,TRANSWLD,JANSSEN 

setvar ACDIinited TRUE 



25 



©macro INTT^TOPOMER ACD 
M initializes topomer calculations 
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globalvar ACD!TopInited ACD!Sites 

if %not( $ACD!TopInited) 
table recall %cat( $DSERV_TB RSCRATCH ) m3 >$nulldev 
table eONFSLN 
5 setvar CHOM!AUgn[ DEBUG ] 
setvar CHOM!Ali£n[ BUMPS ] 
setvar CHOM!Align[ ALICYC ] All_trans 
setvar CHOM!Align[ CHARGE ] None 
setvar CHOMfAHgnl MCORE ] M6 
10 setvar CHOM! Align! ORIENT ] 

setvar CHOM!Align[ FITRMS ] 0.6 

setvar CHOM!Alignt ATTACHED ] 

setvar CHOM!AIign[ CORE_SLN ] • 

uims load %cat( $DSERV_TB chom_batch.core ) >$nulldev 



15 ACD!INrr_STD_TOPOMER 
set CGQ_tinieout 0 

setvar ACD!Sites[FILE] $TA_DEMO/disco_file.dat 

setvar ACD!Sites[FILE] /view/sybBDFR4K/vob/src/sybyl/demo/disco^file.dat 

param modi >$nulldev atom_def F F 4 TH F 9 1.30 GREEN 0.0 \ 
20 4.0 N N 3 12.63 18 16 F { | 

parameter add bond_^ C.3 0.2 1 NO 0.3 C.2 1 NO N.ar H 1 \ 
NO S.o2 N.3 1 NO S.o2 N.2 I NO S.02 N.pl3 1 NO \ 
N.l H 1 NO S.o2 S.3 1 NO \ | >$nulldev 
parameter add bondjength C.3 0.2 1 1.5 0.3 C.2 1 1.5 \ 
25 N.ar H 1 1.0 S.o2 N.3 1 1.5 S.o2 N.2 1 1.5 S.o2 \ 

N.pl3 1 1.5 N.l H 1 1.0S.O2 S.3 1 1.6 j | >$nulldev 

endif 

setvar ACDITopInited TRUE 
#. 
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©macro INIT_STD_TOPOMER ACD 

0 (re)sets standard topomer template info after a TEMPLATE was supplied by a 

REAGH^ 

M 



setvar CHOM! Align[ SLN J NH =CHCH2Any 
tnol in m6 %cat( $DSERV_tB aroidine.mol2 ) >$nuUdev 
setvar CHOM!Align[FlX_CF_CALLBACK] ACD!AMID_TORS 
10 CH0M!INrr_BUILD_3D M5 
tt. 

@expressiort_£enerator tblsFch_val 

15 ================================ 



# performs a search by value within some column of an MSS, 

# returns space separated row IDs 

. localvar rows 

20 table defa$l 
if%eq($#3) 
setvar rows %tablc( %cat( "{RANGEC" $2 •," \ 

%math( $3 - 0.0001 ) %math( $3 + 0.0001 ) ")}" ) ROW NUM ) 

else 

25 setvar rows %table( %cat( "{RANGEC $2 "," $3 $4 ")}" ) ROW NUM ) 

ehdif 

%retum( "$rows" ) 
return 

*• 

30 @expression_generator ACD_DO_RXN 
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# $1 = molecule flies,^ $2 ~ SLN p3ttem, $3 and following are transfonnations 
5 # which convert the reactant in $1 to its produa form, 

# attachm^t point atoms being named by Xn 

# returns TRUE if all went well 

globalvar ACDfrecnos 
10 localvar ma sin tsf recno ats atm fdl nx 

setvar ma $1 
. DEFAULT $ma >$nulldev 
setvar sin $2 
shift 
15 shift 



setvar pat %search2d{ %sln( $ma ) $sln NoDup 1 y ) 
if %nGt{ %eq( 1 %count( Spat ) ) ) 

echo ACD_DO_RXN: $sln not found in %sln( $ma ) 

return 

20 endif 

# set up mapping of SLN IDs to invariant RECNO's 
setvar ats %sln_rgroup_sybid( $ma Spat %Tange( 1 %sln_atom_count( $sln ) ) ) 
for atm in %range( 1 %shi_atom_count( $sln ) ) 

setvar anew %arg{ 1 %set_unpack( %arg( Satm $ats ) ) ) 
25 setvar ACD!recno[ Satm J %atomJnfo( Sanow RECNO ) 

endfor 



setvar nx 0 
# execute reaction, step-by-step 
for tsfm in $* 
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setvar tsfm %set_unpack( Stsftn ) 
switch %uppercase( %arg( 1 $tsfm ) ) 
case ATYPE) 

modify atom type %recno_toJd( $ma SACD!recno[ %arg( 2 Stsfm ) 1 ) \ 
5 %aiB( 3 $tsfm ) 1 1,5 I 1.5 1 1.5 1 1.5 >$nulldev 

II 

caseBREAKB) 

setvar al %recnoJo_id( $ma $ACD!recno[ %arg( 2 $tsfm ) ] ) 
setvar a2 %recno_toJd( $ma $ACD!recno[ %arg( 3 $tsfm ) ] ) 
10 setvar bond %bonds( %cat( $al $a2 ) ) 

if Sbond 

switch %bond_info( Sbond TYPE ) 
case 1) 
case am) 

15 remove bond Sbond >$nulidev 

»> 

case 2) 
case ar) 

modify bond type Sbond 1 > Snulldev 

20 ;; 

endswitch 
else 

echo ACD_DO_RXN: Stsfm but no bond exists 
return 
25 endif 
II 

case SPLIT) 

seivar al %recnojojd( $ma $ACD!recno[ %arg( 2 Stsfm ) ] ) 
setvar a2 %recno_toJd{ Sma $ACD!recno[ %arg( 3 Stsfm ) ] ) 
30 SPLIT Sal $a2 > Snulldev 

>' 

case DELA) 

remove atom %recnojo id( Sma $ACD!recno( %arg( 2 Stsfm ) ] ) > Snulldev 



wo 97/27559 PCTAIS97J0I49I 

241 



case FILLV) 

fillvalence %recno_tojd( $ma SACD!recno[ %arg( 2 $tsfm ) ] ) \ 
%arg( 3 Stsfm ) 1 l.S 1 l.S 1 1.S >$nulldev 
5 setvar AGD!recno[ 96arg( 4 $tsfm ) ] %atom_info( $NEW_ATOM_ID REGNO ) 

tase ADDAT) 

add atom %recnojojd( $ma $ACD!recno[ %arg( 2 $tsfm ) J ) \ 
%arg( 3 Stsfm ) 1 L5 >$nulldev 
10 setvar ACD!recno[ %arg( 4 Stsfm ) ] %atomjnfo( SNEW^ATOMJD REGNO ) 

»f 

case MARKX) 

setvar nx %math( $nx + 1 ) 
setvar aname %arg( 3 Stsfm ) 
15 if %not( Saname ) 

setvar aname %cat( X Snx ) 

endif 

if %gt( %count( %atom_info( %recno_to_id{ Sma \ 
SACD!recno[ %arg( 2 Stsfm ) ] ) ) ) 1 ) 
20 echo WARNING: Multivalent attachment atom in %sln( Sma ) 

endif 

modify atom name %recnojoJd( Sma $ACD!recno[ %arg( 2 Stsfm ) ] ) \ 
Saname >Snulidev 

>» 

25 case MAKEB) 

setvar al %recno Jo jd( $ma $ACD!recno[ %arg( 2 Stsfm ) ] ) 
setvar a2 %recnojojd( Sma SACD!recno[ %arg( 3 Stsfm ) ] ) 
setvar bond %bonds( %cat( Sal $a2 ) ) 
if Sbond 

30 switch %bondJnfo( Sbond TYPE ) 

case 1) 
case am) 

modify bond type Sbond 2 > SnuUdev 
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case 2) 

modify bond type $bond 3 >$nuUdev 

S case ) 

echo ACD_DO_RXN: $tsfin now has type: %bondjnfo( $bond TYPE ) 
return 

»» 

endswitch 
10 else 

add bond Sal $a2 1 1,5 >$nulldev 
endif 

f s 

case CLIP) 

15 # prune some atoms in recognition SLN 

# use remaining atoms in recognition SLN to control mapping of 

# reactant side chains to product side chsuns 
setvar Ip %pos( "Stsfm" ) 

setvar rp %pbs( "Stsfm" ) 
20 if %or(-- %not( $lp )" " %not{ $rp )• ) 

echo Missing parentheses in CLIP command 

return 
endif 
setvar ats 

25 for at in %substr( "Stsfm" %math( $Ip + 1 ) %math( $rp - $lp - 1 ) ) 
setvar ats Sats %sIn_rgroup_sybid( $ma Spat Sat ) 
endfor 
setvar rs 

for at in %substr( "Stsfm" %math{ Srp + 1 ) ) 
30 setvar rs Srs %atomJnfo( %arg( I %set_unpack( \ 

%sln_rgroup_sybid( Sma Spat Sat ) ) ) REGNO ) 

endfor 

# following routine: removes all Sats in ats EXCEPT for those directly 
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# attached to atoms NOT removed. The latter will be labelled XI if $rs is empty, 

# otherwise $rs is to contain RECNO's (invariant after deletions) for all 

# attached atoms NOT removed. 

%chom_rmv_ats( %cat( $ma *{" ?6set_create( $ats ) ")'' ) $rs ) 

5 ;; 
case) 

echo ACD_DO_RXN; Unknown HOW operator: $tsfm 
return 

»> 

10 endswitch 
endfor 

%retum( $nx ) 
return 

15 ©macro FIX_FUSE ACD 



# specific callback for aligning topomer confs of tryptanthrin variants 
20 # ensure that NH=CH-CH2-C bond is 180 degrees and CH-CH2-C:C 

# is 0 before FIT is done regardless of what Concord did to it. 
localvar a 

setvar a :9£set_unpack( $2 ) 
modify torsion %arg( 1 $a ) %arg( 3 $a ) \ 
25 %arg( 5 $a ) %arg( 8 $a ) 180 >$nulldev 

modify torsion %arg( 3 $a ) %arg( 5 $a ) \ 
%arg( 8 $a ) %arg( 10 $a ) 0 >$nulldev 

#. 



30 



©macro AMID^TORS ACD 
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# default callback, ensures that NH=CHCH2 Any torsion is set to 180 
U (minimization will change it) so that MATCH can woiic 
localvar a 
S setvar a %set_unpack( $2 ) 

modify torsion %arg( 1 $a ) %arg( 3 $a ) \ 

%aig( 5 $a ) %arg( 8 $a ) 180 >$nuUdev 

#. 

©expressionjgenerator ACDcalcprop 
10 # 



K calculates physical properties of a previously unknown side chain 

# logP, MW, topmer field (via call to CHOM!THis_Build_3D for conformer) 
15 # uses RSCRATCH as workspace MSS 

# 

globalvar ACDfCycFrag 
localvar split_atms buildhow 

TABLE DEFAULT Rscratch 
20 TABLE CONFORMER SLN 

# set up NULL values so we can tell if calculation failed 
echo %wcell( 1 CLOGP 99.99 ) >$nulldev 

echo %wcell( 1 MW -1,0 ) >$nulldev 

echo %wcell( 1 SLN H<NAME="UNNAMED^COORD3D=(0.000,0.000,0.000)> ) 
25 >$nulldev 

TABLE EVAL ALL 1 TOPOMERIC 



# molecular weight for frag as is 
echo %wceH( 1 SLN $1 ) >$nulidev 
TABLE EVAL ALL 1 MW >$nulldev 
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H LogP is for structure with H instead of open valence 

# echo %sInjo_mol( M4 $1 ) >$nulldev 
default M4 >$nuUdev 

setvar nat %molJnfo( M4 NATOMS ) 
S # fix bad S =0 typing 

if %search2d( $1 S=0 NoDup 1 y ) 

for pat in %search2d( $1 0=S=0 NoDup 0 y ) 

modify atom type %sIn_i:group_sybid{ M4 Spat 2 ) \ 
S.o2 1 1.5 >$nulldev 

10 endfor 

for pat in %search2d( $1 0=S[F]Any NoDup 0 y ) 
modify atom type %sin_rgroup_sybid( M4 Spat 2 ) \ 
S.o I 1.5 >$nuHdev 

endfor 
15 endif 

# following replaces (and greatly simplifies) code that is believed to be obsolete 
FILLVALENCE * H 1 1.09 1 1.09 1 1.09 >$nulldev 

if %not( %gt( %molJnfo( M4 NATOMS ) $nat ) ) 

echo ERROR: NO unfilled valences in new fragment $1 
20 return 
endif 

modify atom name SNEW ATOMJD XI >$nulldev 

echo %wcell( 1 SLN %sln( M4 ) ) >$nulldev 
TABLE EVAL ALL 1 CLOGP >$nulldev 
25 # should check result here and go to simpler evaluation if CLOGP fails 

# Add aligning group for Topomeric. to be found in $CHOM!Align[ MINIT ] 
JOIN %cat( •M4C' %atoms( XI ) ) \ 

%cat( $CHOM!Alignt MINIT J "(6)" ) 1 154 >$nulldev 



setvar cfa 
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setvar CHOM! AIign[ ALICYC J AlUrans 
setvar buildhow CONCORD 
if $ACD!CycFrag 

setvar buildhow NOBUILD 
5 setvar CHOM! AUgn[ ALICYC ] None 

endif 

if %CHOM^THIS_BUILDJD( M4 Sbuildhow $1 A ) 

# remove aligning group before saving & doing CoMFA 

setvar pat %search2d( %sln( M4 ) $CHOM!aIign[ SLN ] NoTriv 0 y ) 
10 if $5 

# need to save recnos of standard split before doing custom split 

setvar split^atms %atom_info( \ 

%slnlrgroup_sybid( M4 Spat 8 ) RECNO ) \ 
%atomjnfo( %sIn_rgroup_sybid( M4 Spat 5 ) RECNO ) 
15 SPLIT %sln_rgroup_sybid( M4 Spat %set_unpack( $4 ) ) >$nulldev 

SPLIT %recno_toJd( M4 %arg( 1 Ssplit^atms ) ) \ 

%recno_tojd( M4 %arg( 2 $split_atms ) ) >$nulldev 

else 

SPLIT %sln_rgroup_sybid( M4 Spat 8 5 )->Snulldcv 
20 . endif 

a evaluate and save CoMFA field 

setvar fsln %cat( 96sln( M4 FULL ) ) 

echo %wceU( 1 SLN SfsIn ) >$nulldev 

%write( $3 Sfsln ) >$nulldev 
25 TABLE CONF SLN 

TABLE ENTER CELL 1 TOPOMERIC NO NO >$nulldev 

TABLE EVAL ALL 1 TOPOMERIC >Snul!dev 

setvar cfa %rcell( 1 TOPOMERIC ) 

else 

30 setvar cfa NULL 

endif 
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# round up and return all the results 
setvar logp %rcell( 1 CLOGP ) 
setvar mw %rcell( 1 MW ) 
if %not( %streql( "Scfa" NULL ) ) 
5 if %eq( "$cfa" LOO ) 

setvar cfa NULL 
dse 

setvar cfa %comfa_hex( 1 TOPOMERIC ) 
endif 
10 endif 

%relum( "Slogp $mw $cfa" ) 

#, 

©expressionjgenerator FIX_ACD 

15 = = = = = = = = = === = = = = = == = = = = = = = = = == = = = = = = = = = =:^ = 3,= 



# does string search/replace for groups - specifically nitro 
globalvar AGD!SLNin ACD!SLNout 
localvar ans p arg ct 
20 setvar ans $* 
setvar ci 1 

for arg in $ACD!SLNin 
setvar p %pos( "Jarg" "Sans" ) 
while $p 

25 setvar ans %cal( %substr( "Sans" 1 %maih( Sp - 1 ) ) \ 

%arg( Set $ACD!SLNout ) %substr( "Sans" \ 
%math{ $p + %strlen( Sarg ) ) ) ) 
setvar p %pos( "Sarg" "Sans" ) 
endwhile 

30 setvar ct %math( Set + 1 ) 
endfor 

%retum( "Sans" ) 
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@macFO cores sybylbasic 



5 

# Converts a hit list of core reactanst into a hit list with cores, properties 

# The side chains will be identical to those in some prototype rxn 

localvar f buff files cct fct corc_sln fcore weird xweird 

10 setvar Xs 
setvar Xlist 
setvar xis 

setvar weird Na K Ca 

setvar fcore %cat( R $1 V $2 ) 
15 TABLE DEFAULT REACTIONS 

setvar vars %tblsrch_val( REACTIONS CLASS JD $1 ) 

setvar how_core %eq(-l " %rcell( $vars MORE^CORES )" ) 

setvar coreflag NO 

if $how_core 
20 setvar coreflag YES 

endif 

setvar fout %open( %cat( Sfcore ".cores' ) "w" ) 
if %not{ $ACD!NoCat ) 
setvar rx %rcell( $vars CLASS JD ) 
25 setvar uname %rcell( $vars NAME ) 
if %not( %eq( 1 %count( $uname ) ) ) 

echo Not a one- word reaction NAME in row Svars : $uname 
return 
endif 
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echo Preparing %cat( $fcore ".files" ) 
TABLE DEFAULT REAGENTS 
setvar rcrows %set_unpack( $3 ) 
for rg in Srcrows 

setvar x %rcell( $rg ATTACHED ) 

setvar Xs %set_or( $x "SXs" ) 

setvar Xlist[ $x ] $Xlist[ $x ] $rg 
endfor 

setvar f %open( %cat( Sfcore "Tiles" ) "w" ) 

# following generates all combinations of all calls (no recursion in SPL) 
setvar npos %set_size( $Xs ) 

setvar n2make 1 

for nx in %sort( %set_unpack( $Xs ) ) 

setvar smax[ $nx ] %counl( $XList[ $nx ] ) 

setvar n2make %math( $n2make * $smax[ $nx ] ) 

setvar idx[ $nx ] I 
endfor 

for i in %range( 0 %math( $n2make - 1 ) ) 
setvar idx %cat( R $nc ) Scoreflag \ 

Suname %cat( Sfcore .cores ) 
setvar base $i 

# establish indexes at each position 

for j in %set_unpack( $Xs ) 
setvar rg %arg( %math( ( $base % $smax( $j J ) + 1 ) \ 

$XList[ $j 1 ) 
setvar rf %rcell( $rg SAME__AS ) 
if %and{ "$rr "%not( %streql( "Srf "?** ) )" ) 

setvar idx $idx $rf 
else 

setvar idx Sidx %cat( Sfcore R %rcell( $rg ID ) \ 
"." %rcell{ Srg ATTACHED ) ) 

endif 

setvar base %math( Sbase / $smax[ $j ] ) 



wo 97/27559 PCT/US97/0149I 

250 

endfor 

%write( $f Sidx ) >$nulldev 
endfor 
%close( $f) 
5 endif 



# now recover additional cores, if needed 



if $how_core 
setvar cvars %tblsrch_val( CORES CLASS JD \ 
%rcell( $vars CLASS JD ) ) 
10 setvar how^core %rcell( Scvars HOW^CORE ) 



setvar valences %rcell( Scvars VALENCES ) 
setvar xls 

for ats in %range( 1 $ valences ) 

setvar xls $xls %cat( X $ats ) 

15 endfor 



-setvar core^sln %rcell( Scvars MORE^^CORE ) 
if %not( $core_sln ) 

echo No MORE^CORE for reaction Svars 
return 

20 endif / 

setvar xrlist %rcell( Scvars XRLIST ) 

if %not( %eq( %count( Sxrlist ) SVALENCES ) ) 

echo mismatch between VALENCES and XRLIST for reaction Svars 
return 
25 endif 

setvar opat %string_insert( %string_insert( %rcell( Scvars XRCORE ) \ 

%arg( 1 Sxls ) %arg( I Sweird ) ) \ 

%arg( 2 Sxls ) %arg( 2 Sweird ) ) 
setvar xrcore %rcell( Scvars XRCORE ) 
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core_get_acd $1 $2 $3 



5 



setvar fhits %cat( Sfcore core, hits ) 
# start processing hits 
if %not( %file_exists( $fhits ) ) 

echo Sfhits (hitlist of core leactants) not found 



endif 

s^var cct 1 

TABLE CREATE hits unity M5 FROM_^A_FILE "Sfhits" | >$nulldev 
10 if $ACD!Price 

table column^append rdbms tcd_price first price 
table column_append rdbms tcd_suppliers first supplier 
table eval new * PRICE,SUPPLIER 

endif 

15 %CRC_NOT_UNIQUE( junk junk ) > Snulldev 
setvar choices %table( * ROW NUM ) 



else 



setvar choices %arg( 1 %rcell( Svars CORE_SLN ) ) 
endif 



20 



for h in Schoices 



if Show core 



table default HITS 



setvar allsin %sln_get_sln_from_table( HITS $h ) 
else 



25 



setvar allsin Sh 



endif 



30 



# cycle through RELEVANT molecular component 
setvar p %pos( Sallsln ) 
while $p 

setvar allsin %substr( "Sallsln*' 1 %math( $p - 1 ) ) \ 
%substr( -Sallsln" %math( $p + 1 ) ) 
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s^var p %pos( "Sallsln" ) 
endwhile 
for q)sln in $allsln 
setvar q)sln %fix_acd( $cps\n ) 

5 if $how_core 

setvar pat %seaich2d( $q)sln $core_sIn NoDup 1 y ) 
if %not( Spat ) 
break 

endif 

10 setvar crc %sln_to_crc( Scpsin ) 

if %CRC_NOT_UNIQUE( $crc ) 

echo Skipping duplicate $cpsln 
break 

endif 

15 if %pos( "[!=" "Scpsln" ) 

echo Isotope slapping Scpsln 
break 
endif 

echo Core Sect -- Scpsln 

20 %sln_to_mol( Ml Scpsln ) >$nulldev 

if %not( %acdjlo_rxn( ml Score jln Showjcore ) ) 
goto nxt_core 

endif 

setvar outsln %slnjabelx( ml Sxls ) 



25 # build XRUST 

setvar osin % string Jnsert( %string jnsert( \ 
Soutsln %arg( 1 Sxls ) \ 

%arg( 1 Sweird ) ) %arg( 2 Sxls ) %arg( 2 Sweird ) ) 
setvar patx %search2d( Sosln Sopat NoDup 1 y ) 
30 if %not( Spatx ) 
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echo Sopat not found in Sosln — skipping core 
goto nxt_core 
endif 
setvar xrl 
5 for X in Sxrlist 

setvar x %set_unpack( $x ) 
ifSxrl 

setvar xrl %cat( $xrl ) 
endif 

10 setvar xrl %cat( $xri %SLN JD( $paU %arg( 2 $x ) ) \ 

%arg( 1 $x ) •=^- J6SLNJD( $patx %arg( 3 $x ) ) ) 

endfor 



# is core symmetric? 
setvar sym 0 
15 %sln_to_mol( M2 Sosln ) >$nundev 

%slnjo_mol( M3 % string Jnsert{ %stringjnsert( \ 
Soutsin %arg( 1 $xls ) \ 
%arg( 2 Sweird ) ) %arg( 2 $xls ) \ 
%arg( I Sweird ) ) ) >$nulldev 
20 if %streql( %sln( M2 UNIQUE ) %sln( M3 UNIQUE ) ) 

setvar sym 1 

endif 
else 

setvar outsln $cpsln 
25 setvar sym 0 

setvar xrl %arg( 2 %rcell( Svars CORE_SLN ) ) 
endif 

### 

### At this point Soutsin is the SLN with XI, X2, etc for the 
30 ### variation sites. 

m 
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m 

Mif Calculate number of rotatable bonds WITHOUT XI, X2 attachment 
m points. 

setvarnewslnl $outsIn 
5 setvar ct 1 

setvar offset 2 

setvar pi %pos( %cat( X Set ) Snewslnl ) 
while $pl 

setvar newslnl %cat( %substr( Snewslnl 1 %math( $pl - 1 ) ) \ 
10 %substr( Snewslnl %math( $pl + Soffset ) ) ) 

setvar cl %math( $ct + 1 ) 
if %eq( $ct 10 ) 
setvar offset 3 
cndif 

15 setvar pi %pos( %cat( X Set ) Snewslnl ) 

endwhile 

setvar scratch_molarea %molemptyO 
%slnjo_mol( $scratch_molarea Snewslnl ) >$nulldev 
setvar oId_default Sdcfault_area 
20 default $scratch_molarea >Snulldev 

setvar bds %sct_create{ %bonds( (*-{WNGSO})&< 1 > ) ) 
setvar mval %set_create( \ 
%atoms( 

<H> + <o.2> + <F> + <I> + <Cl> + <Br> + <n.l> + <LP>-f-<Du> )) 
25 setvar pds %set_create( %bonds( %cat( "{TO_ATOMS(" Smval ")}- ) ) ) 

setvar bds %set_diff( Sbds Spds ) 
ifSbds 

setvar bds %set_size( Sbds ) 
else 

30 setvar bds 0 

endif 

zap Sscratch molarea 

default Sold default >$nulldev 



wo 97/27559 PCTAJS97/01491 

255 

m 

Siftt Soutsln can also be sent to acd_corej)ropsl to generate MW and CLOGP 

m 

setvar props %ACD_Core_Propsl( Soutsln ) 

m 

m Change all X into Y_0 

m 

setvar ct i 
setvar ypfx Y_0 
while TRUE 

if %pos( %cat( X $ct ) Soutsin ) 
setvar outsln %stringjnsert( \ 

Soutsin %cat( X $ct ) %cat( $ypfx $ct ) ) 

else 

break 
endif 

setvar ct %math( $ct + 1 ) 
if %eq( $ct 10 ) 

setvar ypfx Y_ 
endif 
endwhile 

if $how_core 
TABLE DEFAULT HITS 

setvar sin %cal( Soutsin " <FCD=" %table( $h ROW NAME ) \ 

";PRICE='* %rcell( $h PRICE ) ";SUPPLIER = - %uppercase( \ 
% ACD_Get_Preferred_Supplier( %rcell( $h SUPPUER ) ) ) \ 
«;MW=- %arg( 1 Sprops ) ";RBD='' $bds ";LOGP=" \ 
%arg( 2 Sprops ) •;SYM=- $sym -;XRLIST=" $xrl 

else 

setvar sin %cat( Soutsin "<MW=" %arg( 1 Sprops ) ";RBD=" $bds \ 
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-;LOGP=- %arg( 2 Sprops ) ";SYM=- Ssym -;XRLIST=" \ 
$xrl ">") 

endif 

%write( Sfout $sln ) >$nulldev 
5 if$ACD!Test 
goto alldone 
endif 
nxtjcore: 

setvar cct %maih( $cct + 1 ) 
10 break 
endfor 
endfor 

alldone: 

%dose( Sfout) >$nulldev 
15 if $how_core 

TABLE CLOSE hits NO >$nulldev 

ACD!Record CORES %cat( Sfcore ".cores" ) Scvars VARIANTS UPDATED 
endif 

#. 

20 ®expression_generator stringjnsert 



setvar p %pos( $2 $ I ) 
25 if$p 
if $3 

setvar ans %cat( -%substr( $1 1 %malh( $p - 1 ) )" \ 

$3 -%substr( SI %niath( $p + %strien( S2 ) ) )" ) 
%retum( Sans ) 
30 else 

setvar ans %cat( "%substr( $1 1 %math( Sp - 1 ) )** \ 
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"%substr( $1 %fnath( $p + %strlen( $2 ) ) )" ) 
%retum( Sans ) 

else 

5 %retuni( $1 ) 

cndif 

@expression_generator ACD_extract_ridX 

10 = = = = = = = = =: = == = = = ^ = = - = = = = = = = = = = = = = = = = = = = = = = = = 



# backs out the row id and X from the input file name 

# get rid of first few chars 
s^ar arg %substr( $14) 

15 setvar r %pos( R $arg ) 
setvar p %pos( $arg ) 

%return( %set_create( %substr( $arg %math( $r + 1 ) %malh{ $p - $r - I ) ) \ 
%substr( $arg %math( $p > 1 ) ) ) ) 

#. 

20 ©macro core_get_acd sybylbasic 



# do reagent searches in ACD for all specified rows in reagents 

25 localvar fct rg sfrag buff bf hfname 

setvar rg %tblsrch_val( CORES CLASS JD $1 ) 
setvar sfrag %rcell{ $rg MORE_CORE ) 
setvar hfname %cat( R $1 V $2 core ) 



wo 97/27559 PCT/US97/01491 

258 

if %or( "$ACD!DoSearch" •'%nol( %file_exists( %cat( Shfhame .hits ) ) )" ) 

# prqiare notlist file 

setvar notf %op^( %cat( Shfhaitie .bad ) "w" ) 
for not in %rcell( $ig CORE^NOTUST ) 
if %file_exists( Snot ) 

# write out all bad fragments NOT CONTAINED by SEARCH FRAGMENT 

setvar bf %open( Snot "r" ) 
while %not( %eof(Sbf)) 
setvar buff %read( Sbf ) 

if %and( "^notC %eof( Sbf ) )" •%not( %streql( \ 
-%substr{ -Sbufr 1 1 )" ) )- ) 
if %nol( %search2d( Ssfrag "SbufP NoTriv 0 y ) ) 
%write( Snotf Sbuff ) > Snulldev 

else 

echo Not excluding Snot \ 

fragment Sbuff (contained in Ssfrag ) 

endif 
endif 
endwhile 
%close( Sbf ) 

else 

%write( Snotf Snot ) > Snulldev 
endif 
endfor 

%close( Snotf ) 

# prepare query file 

setvar notf %open( %cat( Shfname .query ) "w" ) 
%write{ Snotf Ssfrag ) > Snulldev 
%close( Snotf ) 

# do search (first time for individual components, 

# second time to filter umlticomponent cpds retrieved) 
echo Searching for Ssfrag 
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setvar dbs del $AGD!cmd -database $ACD!db -qfile \ 

%cat( $hfname .query ) -notlist %cat( Shfname .bad ) \ 
-hitlist tmp.hits -coords 

if$ACD!Test 
5 setvar dbs $dbs -maxhits 10 

endif 

$dbs 

setvar dbs del $ACD!cmd -database tmp.hits -dbtype sin -qfile \ 
%cat( Shfname .query ) -notlist %cat( Shfname .bad ) \ 
10 -hitlist %cat( ShAiame .hits ) 

$dbs 
endif 

#. 
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Appendix "F" 

/♦E+:SYB_MGEN_GPLS_COMFA_HEX */ 



5 ♦ im S YB_MGEN_GPLS_COMFA_HEX( identifier, naigs, args, writer ) 

» m 

* Expression generator that returns hex version of a fingoprint * 

* interface: * 
10 * 

* %comfa_hcx{Row ( CoMFA_col) * 

* with Row being a row to dump * 



♦ CbMFA_col being a column selection for the topomer fingerprint 

* handles steric field or if 3 aigs dectrostatic * 
15 * , converts fpt to 4 bits * 




int SYB_MGEN_GPLS_COMFA_HEXCidentifier, nargs, args, writer ) 
char ♦identifier; 
20 int nargs; 
char *argsO; 
PFI writer, 
{ 

int row, type, preset; 
25 int err, i; 
setjHrref; 

ROWCOL^SEL^PTR row_sel; 
char ♦dum, *cname, ♦pamame, *table; 
FieldPtr ofield; 
30 ComfaMolPtr cmp; 

if (! LM>CCESS^CHECK_CmpdSelC'CmpdSer,''CmpdSer) ) 

{ UBS_OUTPUT_MESSAGE(stdout,"This requires a license to CmpdSelAn"); 
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return 0; } 
if (nargs < 2 1 1 nargs > 3 ) 
{ 

UIMS2_WRrrE_ERROR( 

"Enw: %coni&_hex (Row PrintCol (fidd 2 ) )\n" ); 
tctamO; 



I* grt the column */ 

if (!(table=TSH_APUNT_GBr_DEFAULT_TABLEO ) ) goto badcol; 

if (!(UIMS2_VARTYPE_CALC_VALUE(-COL_SEL-.argstl], &row_sel)) 1 1 
!TBL_ACCESS_INDEX_TO_COLNAME( table , row_sei->id -1, &cname ) 1 1 
!TBL_ATTR_SAMPLE_COLUMN_A(table, cname, "FIELD", &duin, &prcsent) 
1 1 fpresent) 

{ UBS_OUTPUT_MESSAGE(stdout,"Not a valid CoMFA column.Vn"); 
gotobadool; } 

/* get the reference row */ 

if (!(UIMS2_VARTYPE_CALC_VALUE("ROW_SEL".args[0l, &iDW_sel)) 1 1 
!TBL_ACeESS_X_GEr_VALUE(table, row_sel- >id -1, cname. 

•CELL_SUPPORT", (int •)&cmp, &err ) ) 

{ 

UIMS2_WRlTE_ERROR( 

"Enon Invalid reference row selection for %lp_hex\n" ); 
return 0; 

} 

if(!crap 1 1 !(ofidd = (nargs == 3) ? cmp->end_p : cmp->sfldj)) ) { /* 
the data is not there */ 

UBS_OUTPUT_MESSAGE(stdout,"Not a valid CoMFA ceUAn-); 
goto badcol;} 

dum = UIMS2_MessageBuffer; 

for (i=0;i<ofield->n jwints ;i+4-, dum H- = l ) 

sprintf(dum, •%.lx% lookup_my_comfa_code(ofield->field_value[i]) ); 
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(•writerK UIMS2_MessageBuffer ); 
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return 1; 
badcol: 

UIMS2_WRITE_ERROR( 
5 "Error: Invalid oriunin sdecdcMi for %comfe_toc\n" ); 

return 0; 

) 

ut lookup_royjcomfacode(value) 
ijpt value; 
10 { 

static fpt cutofr[16] = {9999., 0., 2., 4., 6., 8., 10., 12., 
14., 16., 18., 20., 22., 24., 26., 30. 

}; • 

int i; 

15 if {!DABS_DUT_OKDATA(value)) return 0; 

for (i=l;i<16;i++) if (value < = cutoffD]) return i; 
UBS_OUTPUT_^MESSAGE(stdout,-Invaiid fidd value above 30.0 set to 
niissing.Xn"); 
return 0; 

20 } 

/*E+:SYB^MGEN_GPLS_FP.HEX */ 

* * 

* int SYB_MGEN^GPLS_FP_HEX( identifier, nargs, args, writer ) * 
25 ♦ * 

* Expression generator that returns hex version of a fingerprint * 

* interface: * 

* * 

30 * %fp_hex(Row (Finger_col) * 

* with Row being a row to dump * 



wo 97/27559 PCTAJS97/01491 

263 

* Finger_col being a column selection for the fingerprint * 

* * 

int SYB_MGEN_GPI5_FP_HEXCi<lentifiCT, naigs, args, writer) 
5 char ^^idoitifioi 
int nargs; 
char *aiEsn; 
PFI writer; 
{ 

10 int row, type, present; 
int err, i; 
setj)tr ref; 

ROWCOL_SEL_PTR row^sel; 
char *dum, *cnaine, ^'pamame, *table; 
15 if (! IJ^_ACCESS_CHECK_CnipdSel(''CmpdSel%"CmpdSer) ) 

{ UBS_OUTPUT_MESSAGE(stdout,This requires a license to CmpdSelAn"); 
return 0; } 
if(nargs!=2) 

{ 

20 UIMS2_WRITE_ERROR( 

"Error: %^_hex (Row PrintCol )\n" ); 
i^mO; 

} 

/• get the column */ 
25 if (!(table=TSH_APLrNT_GET_DEFAULT_TABLEO ) ) goto badool; 

if (!(UIMS2_VARTYPE_CALC_VALUECC0L_SEL",args[ll, &row_sel)) 1 1 
!TBL_ACCESS_INDEX_TO_COLNAME( table , row_sel->jd -1, 
ftcname)) 

goto badcol; 

30 if (! TBL_UTL_COLTO_FUNCnON(table, cname, &painame)) 
goto badcol; 

if {!tBL_ATrR_nND_COLUMN_A ( table, parname. 
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•TYPE", Adum, Atype )) 

goto badcol; 
type = TBLJO_TYPE_TO_KEY( type ); 
if { type != PROC_V_PRINT && 
5 !(TBL_ATTR_SAMPLE_COLUMN_A(table, cname, "FINGERPRINT", 

&duin, &present) && fnesent ) ) 
goto t»dcd; 
/* get the lefeienoe row */ 
if (!(UIMS2_VARTYPE_CALC_VALUE("ROW_SEL",aigs(01, &row_sel)) j | 
10 !TBL_ACCESS_X_GEr_VALUE(table. row_sel- > id -I . cname, 

•CELL_SUPPORT-, (int *)&ref, &err ) 1 1 

!ref ) 

{ 

UIMS2_WRITE_HUlOR( 
15 "Error. Invalid reference row sdection for %fjp_hex\n" ); 

return 0; 

} 

dum = UIMSi^MessageBuffo-; 
err = (ief[0]+31) / 32; 
20 for (i=l;i<=crr dum +=8 ) 

sprintf(dum, "%.8x*, refp] ); 
(^ter)( UIMS2_MessageBuffer ); 

return 1; 
badcol: 

25 UIMS2_WRITE_ERROR( 

"Error: Invalid column selection for %fip_hex\n" ); 
return 0; 
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Appendix "fi" 

*/ 

/* power 
5 */ 

/* David E, Patterson 
*/ 

/* substantially changed 6/96 for cores-based reorganization of operation 

* updated to include more reaction info (Dick Cramer - 10/24/96) 

* updated to use DB_CT_CCT_GET_PRD routines 10/29/96 DEP) 
15 * 

* This program performs the following functions: 

* (1) read in one line from a ".files" file, one line per corcs/Xl/X2 
file 

* (2) read in one core to process (core / XI / X2 file) == a cSLN 
20 * (3) for each cSLN, open a fp file to contain fingerprints 

* (a) first is fingerprint size in bits 

* (b) 2cd is number of records in segment (header + core + nl + n2) 

* (c) 3ni record notes aze of record in bytes 

* (d) 4th is number of cSLN s^ments included ( = = 1 here always) 
25 * (e) 5th and following ints contain the ASCII .2DRULES filename 

* (A) next (second) record represents an "augmented fingerprint" 

* which is made by attaching invariant pieces of XI and X2 to 
core 

* - > cardinality plus bitset is the record for every fp < - 

30 * (B) then Nl + N2 augmented fingeqjrints records for all of the 

* structural variations 
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* (4) compute MBITS and LBITS estimates of worst case missing bits 

* (5) write out a "master record" entry for the result 



5 * power -file <name> -line <m> -core <n> -fraction <f> -screendef <file> 
* -prefix <file> +debug 



* C^ons: 



10 * -file name - name is file with names of cores/Xl/X2 that 
* determines what gets built 



* -line number - which line in file to process 
15 * -core number - which core in corefile named in line to process 



* -fraction f - fraction of products to be evaluated, 0.0 - 1.0 

* or if more than 1,0, it is the NUMBER desired 
and an appropriate fraction is computed to 



4r 



20 yield 



approximately this number 



♦ -screendef file - name of a file containing the fingerprint 



25 * 

30 * 
* 



definition rules. 

-prefix file - name from which output filenames will be formed 
(i.e. -prefix Hi -> Hi.fp and Hi.mf 

+debug - writes irrelevant info to stderr 

This flag forces the display of all 
options 
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/* use 3db 

* dbcc power.c -o power */ 
5 iRnclude <stdio.h> 

^include <signal.h> 

jj^lude <ctype.h> 

jSinclude <unistd.h> 

#include <string.h> 
10 iKnclude < sys/stath > 

jj^ndude < math.h > 

^include •parsecqjt.h* 

include •utl_str.h'' 

j^clude "ud^memJi* 
15 jfinclude -uU^filch" 

ftnclude "uU_inath.h" 

Anclude "cth" 

^include "ctjexpnh" 

iS'include "ct^proto.h" 
20 #include "importjroto.h** 

#define GoodExit 0 
#define &ForExit 1 

#definc Visual(s) { fprmtf s; } 

static int (*ExploderFunction)0; 

25 static char *ScreenFiIeNaine; 

static char DefaultScreenFileName[32] 



/ 



= ''standanl,2DRULES*; 



static int 



^ScreenStructure; 



static int 



**fingerPointer, 
♦fingerPrint = 0; 
*fingcrMask = 0; 
fingo'Bits; 



30 static int 



static int 



static int 
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Static int 
static int 
static double 
static int 

S static char 
static char 
static dtiBT 
static char 
static char 

10 static diar 

static FILE 
static FILE 
static FILE 

IS static int 
static char 
static char 
static char 
static int 

20 static int 
static int 
static int 
static int 
static int 

25 static int 
static int 
static int 
static char 
static int 

30 static int 
static int 
static int 
static int 



268 

Mbits; 
Lbits; 

Fraction *= 1.0 ; 
TopNumtw = 0; 
♦FileOfFiIes; 

*Corefae. *Xlfilc, *X2fflc; 

♦PrefixForFiles; 

*ReactionCode; 

'■'UserRxnName; 

DefaultPrefixForFiles[20] 

= "csln_preprocess''; 
♦InputSourceFile; 
♦FileGfFilesFile; 
♦fpFile; 
nbits[2561; 
♦fuUQucry; 
**FGPT_X; 
Artist; 

WordsPerFingerprint = 0; 

BytesPcrFingerPrint = 0; 

CurrentSlnld = 0; 

DebugLevel; 

User Aborted; 

NullCore; 

MoreRxnInfo; 

StaitCore = 0; 
. LineFile = 0; 

*CombNameTemplate; 

CombCounter; 
**Y_01; /* fingerprints */ 
**Y_02; /* • */ 
nY_01; /* number of structures */ 
nY_02; /* - 
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static int nProcessed = 0; 

static void *fullCsln, *xcoreCsln, *templCsln, *temp2Csln; 
static char *CoieSIn; 
static char *Xlxfile, *X2xHle; 
S static strua ParseOptions OptionsQ { 

*** DO NOT MOVE ENTRIES IN THIS TABLE. ADD ENTRIES ONLY AT THE 
END. 
**♦/ 

10 {"file", ParseOptString, &FileOfFUes, 

"File listing all input files" }, 
{ "fraction" . ParseOptDouble, &Fraction . 

"Proportion of products(0 to 1) or Number to test" }, 
{"screendeP, ParseOplOldFile, &ScrecnFileNamc, 
15 "File which defines the UNITY screen" } , 

{"line", ParseOptlnt, &LincFile, 

"Sequential entry to use in Files file" }, 
{"core", ParseOpflnt, &StartCore, 

"Sequential core to use in Cores file" }, 
20 {"prefix". ParseOptString, APrefixForFiles, 

"Filename root for output files" }, 
{"debug", ParseOptBoolean, &DebugLevel, 

"Use +debug to enable debugging messages" }, 

>; 

25 int UBS_OUTPirr,MESSAGE0 { return 0; } /* just for compiling OK */ 
int UIMS2_WRITE_PHOTO0 { return 0; } 

int lowercase (s) char *s; {while (*s) { if isupper(*s) *s = tolower(*s); 
S++;}) 

static void UserHitControlCO 
30 7*+I 

* This function is the signal handler for user initiated program 
termination. 
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* If s only role is to set a flag indicating that the user wishes to abort 
the program. 

* Author Date Description 

* G. B. Smith 02-09-93 Original Version 
* 

{ 

10 UswAborted =1; 

} 



static int ParseATguments( argc, argv ) 

/*+! 

*■ 

IS * This function parses the command line argum^ts. 

m 

Returns: 1 on a successful command line parse, 0 otherwise. 

m 

* Warnings: 
20 * 

* Errors: 

♦See Also: 
* 

25 * 

* Author Date Description 

* = = = = = = = = = = = = = = = = = = = = = = = = 

* G. B. Smith 02-09-93 Original Version 
♦ 

30 */ 

int argc; 
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{ 

int nargs, 

nopdons sizeof( Options )/si2eof(Opticms[0]); 
5 nargs = UTL_PARSE_OPT( aigc, aigv, noptions, Of^ons ); 

if( Inaigs ) goto SyntaxEnor; 

if ( (IStartCore) i | (lUneFUe)) leturo 0; 

if (!PrcfixForFiies) PcefixForFiles = I>e£au]tPiefixForFiles; 

return 1; 
10 SyntaxEnor: 

return 0; 

} 



int inain( aigc, argy ) 
/*+E 
15 * 
*/ 

int argc; 
char •*aigv; 

{ 

long 



20 



startTime, 
totalTime, 
finishUme; 

*** Establish handler for a user interrupt. 

25 

signal( SIGINT, UserHitControlC); 
#ifdef SIGHUP 

signal( SIGHUP, UserHitControlC); 

#endif 

30 if( ! Parse Arguments( argc, argv ) ) 

goto SyntaxEnor; 
time( &startTime ); 
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Visiial((stderr, "Begin reading csln : %s*',ctime(&startTinie))); 
/* Let*s actually do something now ♦/ 
WarmUpO: 

if(!(FfleOfFilesFilc = UTL^FILE.FOPEN(FUeOfFUes/r"))) return 0; 
GetFileSet (FileOfRlesFile); /♦ geteSLN info - core. XI, X2 */ 
if (! FGPT_X[0] 1 1 ! FGPT.X(1] ) goto FailurcExit; 
if (!*FGPT_X[0] 1 1 !*FGPT^X[l] ) goto FailurcExit; 
if (IReadTbeCslnlnfoO) goto FailurcExit; 
tinie( &finishTime ); 

Visual((stden',"Be^n computations: %s",ctime(&fimsh'nme))); 
time( &finishllme ); 

if (lUserAborted && IDoPiecewiseFingerprintsO) goto FailurcExit; 
totalTime = finishTime - startTime; 
if( ! totalTime ) totalTime = I ; 

Visual((stderry "Created %d Finger Processed reagents in 

nProcessed = nY_01+nYj02 )); 
Visual((stdOT/%d Hours, %d min, %d secsVn", 
totalTime/(60*60), 
(totarnme%(60*60))/60, 
(totaITime%60))); 
Visual((stderr/Each comparison required %.8f seconds to calculateVn", 
(totaITime/((double)(nProcessed?nProcessed: 1))))); 

time( &fimshTlme ); 

Visual((stderr,"\nNpw evaluating missing bits distribution at %s\n", 

ctime(&fmishTime))) ; 
if (lUserAborted && iCheckMissingBitsQ) goto FailurcExit; 
CoolDownO; 
time( &finishTime ); 

Visual((stderr,*End bits checking: %s'',ctime(&finishTime))); 
Visual((stden/End cSLN preparation : %s",ctime(&finishTime))); 
UserAborted ? exit(ErrorExit) : exit(GoodExit); 
Syntax&ror: 
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exit(l); 
FailurcExit: 

exit(EnorExit); 

} 

5 int GetFaeSet(0 
FILE*f; 

{ 

char •three_files, *hold. *!pch; 
int i; 

10 /* does not read the core itself */ 

for (i=0;i<LineFile;i++) 
if( -1 == UTL_SCAN_GETS(FileOfFUesFile, -\\","#",&diree_files)) return 

15 /* see how many toicens there are - if >5. new format with rxn data */ 
for 0 0. pdi = three_files; *pch; pch++) if (*pch == • •) i++; 
if ((MoreRxnInfo = i>4) ) { 

for (pch = three_files; •pch != "; pch++); *pch++ = '\0'; 
if(!(Reaeti(mCode = UTL_STR_SAVE( three_files ) )) return 0; 
20 f(H- (hold = pch ; -^h != ' pch++); •pch++ = 'W; 

NuUCore = (int) strstr( "YES", hold ); 

for (hold = pch; •pch != ' '; pch++)i = '\0'; 

if (!(UserRxnName - UTL_STR_SAVE( hold ) )) return 0; 

} 

25 else pch = three_files; 

for (CoreiUe = pch; •Corefile == "; Gorefile++) ; 
for (Xlfile = Corefile ; *Xlfde != ' *; XlfUe++) ; 
•Xlfile++ = '\0'; 

for ( ; *Xlfile = = ' '; Xlfile++) ; 

30 for (X2me = Xlfile ; *X2file != ' *; X2file++) ; 
*X2file++ = '\0'; 

for( ; *X2file == "; X2filc++) ; 
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Coiefile = UTL_STR_SAVE(Corefile); 
Xlffle = UTL_STR_SAVE( Xlfile); 
X2ffle = UTL_STR_SAVE( X2file); 

5 how = 0; 

nYjOl = tBstiead(Xlfile,hdld,l); 
nY_02 = testiead(X2fiIe,hokl,2); 
return 1; 

} 

10 /♦ free up ttie arrays in the loop */ 
int CoolDownO 

{ 

char *hold; 
inti; 

15 for 0=0;i < nY_01;i+ +) UTL_MEM_FREE(Y_01[i]); 
UTL_MEM_FR]^Y_Ol); 

for 0 =0;i < nY_02;i+ +) UTL_MEM_FREE(Y_02[i]); 
UTL_MEM_FREE(Y_02); 
UTL_FrLE_DELETE(XlxfiU5); 
20 UTL_FILE_DELErE(X2xfiIe); 

UTL_MEM_FREE(Gorefile); 
UTL_MEM_FREE( Xlfile); 
UTL_MEM_FREE( X2file); 
25 return l; 
} 

int WarmUpO 
{ 

ihti; 
30 FILE*fp; 

forfi=0;i<256;i++) nbits[i] = (i&l) + (i&2)/2 + (i&4)/4 + (i&8)/8 + 

(i&l6)/l6 + 0&32)/32 + (i&64)/64 + 

(1&128)/128 ; 
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if (IScreenFileName) ScreenFQeName = DefaultScreenFUeName; 
if (!(fp = UTL_FII£_FOPEN(ScrecnFilcName,"r-))) return 0; 
ScrecnStrucUire = Gnt ♦) DB__Brr2 TARSE_2DSCREEN(fp); 
UTL_FILE_FCLOSE(fp); fp = 0; 
5 if (iScreenStnicture) retura 0; 

BytesPerPingerPrint = DB__BIT2_GET_SIZE( ScreenStnictuie ); 
W<HdsPerFingerprint = (BytesPerFingerPrint + 3) / 4; 
fingCTPriiit = fmt *) \m.J4EM_AULOC( BytesPoi^ingerPrint); 
fingcrMask = (int *) UTL_MEM_AUXX:( BytesP^FingerPrint); 
10 if (Fraction > 1,0) TopNumber = Fraction; 

Get_^BY_SLN_MaskO; /* Set up for lBITS by ignoring the counts */ 

FGPT_X = (char**) UTL_MEM_ALL(X:( sizeof{char *) * 2 ); 
return 1; 
15 } 

int Get_BY_SLN_MaskO 
{ 

/* placeholder until a general one is writt^. 
This is correct for standard.2DRULES as of 6/96 */ 
20 int i; 

un^gned char *foo; 

foo = (unsigned char *) fmgerMask; 

forO= 0;i<116;i+-h)*foo++ =OxFF; 

for (i=116;i<124;i++) ♦foo++ = 0; 
25 return 1; 

} 

char *GenerateMySbi(core) 

char *core; 

{ 

30 /* ??? CONVERT THE Y_Ox to Xn in core ??? */ 

char *foo, *oof, *goo; 
goo = UTL_STR_SAVE(core); 
foo =strstr(goo/Y_Or); 
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foo[l]=fbot2]=' foo[01='X'; 
oof =strstr(goo/Y_02"); 

oofIl]=oofI2]«* *; ooftOJ^'X'; 
for (oof=foo=goo; *oof; oof++) 
5 if roof !== • •) *foo++ = *oo{; 

*foo = 'W; 

rrtum goo; 



Tffis routine should open the fp ou4)ut file 
10 generate the full cSLN 

generate the augmented core SLti 
write header and augmented core fp to fp file 
generate *.rgroup files latrar fp work. */ 
int ReadTheCslnlnfoQ 
15 { 

inti; 

char ^unk, ♦hold, *line, *one, *two, *thr, *fou, *fiv, *six ,*l^ion; 

char ♦myjconcatenateO* *augmentO; 

char *my_^how__youve_^rownO; 
20 FILE *tfil; 

if (! tInputSourceHle = fopen(Corefile/r*))) return 0; 

for (i =0;i <Starteore;i+ +) 
if (-1 == UTLSCAN_GErS( InpulSourceFile, "VX", &line)) return 0; 

fdose(InputSpurceFile) ; 
25 if (IGrabXrlistOine)) return 0; 

one = strstr(line,"<*); 

*one= '\0'; /* zap the parameters at the end of the line*/ 

CoreSln = GenerateMySln(Iine); 

if (!(hold = iril,_STR_CONCATENATE(PrefixForFiles/.fip"^ return 0; 
30 if (! CfpFile = fopen(hold,"w"))) return 0; 

UTL^MEM_FREE(hold); 
i = BytesPerFingerPrint * 8 ; 
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irrL_nLE_FWMTE( &i .sizeoftirit), 1 ,^File); 
fingerPrint[0] = 2 + nY_01 + nY_02; 
finga^Print[l] = sizeofCint)*(WonlsPaFingeq)rint + 1); 
fingerPrint[2] = 1; 
S junk ScreenFildiaine; 

hrfd = (char *) &(fiiigerPrint(3]) ; 

for 0=0; i < (WoidsPerFuigerprint-3)*sizeof(int); junk++) 
{ *hold++ = •junk; 
if (! «junk) break; } 
10 IJTL_FILE_FWRrrE(fingerPrint,si2eofCint),WoidsPerFingeiprint,^^^ 

if (!(XlxfUe = XJTL_STR_CONCATENATE(PrefixForFfles,".FGPT.l"))) return 0; 

tfil «= fopen(Xlxfile,"w'); 

fiprintf(tfil,-«s\n",FGPT_X[01); 

fclose(tfil); 

15 if (!(X2xfile = UTL_STR_CONCATENATE(PrefixForFaes,'.FGPT.2"))) return 0; 
tfil = f<q)en(X2xfae,"w"); 
ipiintf(tfil,"%s\n",FGPT_X[l]); 
fclose(tfil); 

if (!sln_defuies_csln( &xcoreCsln. Xlxfile. X2xfile)) return 0; 
20 if (!sln_defines_csln( ftteroplGsln, Xlfile , X2xfUe)) letum 0; 
if (!sln_defuies_csta( &temp2CsIn, Xlxfile, X2file)) return 0; 
if (!sln_defines_csln( &fuUCsln. Xlfile, X2file)) return 0; 
return 1; 

25 int GrabXrlist(string) 
char *string; 

{ 

/* find XRLIST= and grab what's in there ! •/ 
char •foo, *strip_downO; 
30 if (((string = strstr(string,"XRUST="))) return 0; 
Xriist = strip_down(string); 
return 1; 
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} 

int testread(old, new, which) 
char *old, *new; 
int which; 
5 { 

FILE *filc,_*dif; 

int i; 

char *line; 

diar *strip_downO; 
10 /♦ get and hold FGPT_X info here 

Expect it to be at top of file preceded by a # */ 

if (!(file = fop^(old/r"))) letum 0; 

if (new && !(elif = fopen(new,"w*))) return 0; 

which-^; 
15 FGPT_.X[which] = 0; 

while (!FGPT_X[which]) 

{ if (-1 == UTL_SCAN_GErS( file, •'\\^ &line)) return 0; 
if ( line = strstr(line/FGPT_X=") ) FGPT_X(whieh] = strip_downOine); 

} 

20 /* this won't really work if the attachment point is NOT the first atom 
Bsted*/ 

FGPT_X[which] = UTL_StR_CGNCATENATECRl". FGPT_X(which]); 

forCi-O; ;i++) 

{ 

25 if (-1 == UTL_SCAN_GETS( fUe. "W", &line)) break; 
if (new) 

{ UTL_SCAN_TOKENIZE(line,' < '.'W); 
<printf(eUf,"%s\n",line); } 

} 

30 fdose(fUe); if (new) fclose(eliO; 
return i; 

} 
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char *strip_down(string) 
diar *string; 

{ 

int i; 

S char foo, ^'letme; 

string = strstr<string,*=*) ; 

for(; *string || ♦string string++) foo = *string; 

if (fbo!== •"•) 
{for (1=0; ;i++) 
10 if ( (stringfi] == *;') !| (stringp] =-'>*)) break; } 
else 

{for (i=0; ; i+4-) 

if ( (stringp] == break; } 
foo «= stringfi]; 
15 stringp] = *\0'; 

r^e = UTL_STR_SAVE(string); 
stringp] = foo; 
return ri^e; 

} 

20 /* Asstune that the fp file is opened and written to earlier *l 
int DoPiecewiseFingerprintsO 
{ 

char ♦hold, ♦linel, *lin62; 
int i; 

25 if (!(Y^01 = (int *♦) UTL^MEM^ALLOC( nY^Ol * sizeof(int *)))) 
return 0; 

if (!(Y_^02 = (int **) UTL_MEM_ALLOC( nY_02 * sizeof(int *)))) 
return 0; 

30 MakeAHPrints( xcoreGsln , 1, 1, &fingerPrint); 

DB_CT_CCr_GET_PRD_CLEANUP( xcoreCsln ); 



foi^i=0;i<nY_01;i++) 
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{ 

if (!(Y_Oiri] = (int *) UTL_MEM_ALLOC(WordsPerFingeiprint * 
size(rf(int)))) 
r^m 0; 

5 } 

MakBAllPrints( templCsln , nY_01. 1, Y_01); 
DB_CT_CCT_GET_PRP_GLEANUP( templCslii ); 
forC«=0;i < nY_02;i+ +) 
{ 

10 if (!(Y_02[il = (int *) UTL_MEM_ALLOC(WordsPerFingeiprint * 
azeof(int)))) 

return 0; 

} 

MakeAilPrints( teinp2Csln , I, nY_02, Y_02); . 
15 DB_CT_CCT_GET_PRD_CLEANUP( teinp2Cslii ); 
r^m 1; 

} 

int WritefpFunc(sCnict GtConnectionTable *ct, int num, int*^ndexes) 
{ 

20 intnbits; 
int *^^rint; 

^rint = *fmgerPointer+ + ; 
memset ( fprint, 0, BytesPerHngerPrint ); 
if( !DB_BIT2_EVALUATE( ct, ScreenStructure, fjprint, &nbits )) 
25 return 0 ; 

UTL_FILE_FWRITE( Anbits ,sizeof(irit), 1 ,fpFile); 
im._FILE__FWRri^fiprint,sizeofCint),WordsPerFingerpri^^ 
return 1; 

30 } 

int GrabfpFunc(struct GtConnectionTable *ct, int num, int**indexes) 
{ 
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int *fprint; 

fprint = *fingerPointer++; 
memset ( fprint. 0, BytesPerFingerPrint ); 

if( !DB_BIT2_EVALUATE( ct, ScreenStnicture, ^jrint, &fingCTBits )) 
5 return 0 ; 

return 1; 

} 

int MateOnePrint( vwd *Csln , int i, int j. int ^fp) 
10 { 

static int **pioductIndexes = 0; 
if (Ipioductlndexes) 

{ producdndexes = (int **)UTL_MEM_CAlJLOC(2,sizeof(int *)); 
productlndexes[0] = (int *)UTL^MEM_CALLOC(l,sizfeof(int)); 
15 productIndexes[l] = C«it *)UTL_MEM_CALLOC(I,sizeofCint)); 
} 

pToductIndexes[0][0] = i+1; 
productlndexes[l][0] = j+1; 
fingefPmnter = &fp; 

20 DB^CT_CCT_GEr_PRD^PRODUCT(Csln, 1, producOndexes, GrabfpFunc); 
return I; 

} 

int MalceAllPrints(void ♦CslnThing, int nl, int n2, int **pfp) 
{ 

25 int numPfoducts, **prDducandexesri, j, nProcessed; 
int numConnections = 2; 
numProducts = nl * n2; 
nProcessed =0; 

producUndexes = (int **)UTL_MEM_CALLOC(numConnections,sizeof^^^^ *)); 
30 for ( i = 0 ; i < numConnectiras ; i++ ) 

productlndexesp] = (int *)inT._MEM_CALLOe(numProducts,sizeof(int)); 



for fi =0;i < n 1 ;i + +) for (j =0y < n2y 4- +) 



1 
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{productIndexesIp][nPfOcessed] = 
productInd«es[l][nPiocessed] = j + 1 ; 
nProcessed-f +; 

1 

5 fingerPointe- = pfp; 

DB_CT_CCr_GET_PRD_PRODUCT(CstaThing, numProducts, producandexes, 
WritefpFunc); 

for ( i = 0 ; i < numConnections ; i++ ) irrL_MEM_FREE(pioductIndexesni); 
UTI._MEK{_FRHE(pFoductIndexes); 
10 mum 1; 

} 

Z*** Also find Mbits and Lbits 

and write them where they belong */ 

/* 

IS Should reorganize to find worst cases rather than pure random 

*/ 

int CheckMissingBitsO 

{ 

int aigCount, err =1, j; 
20 int counts[21]; 

nProcessed = 0; 

for Ci=0;i<21;i++) cmints[i]=0; 

if (TopNumb^) Fraction = (double) topNumber / (double)„(nY_01 * nY_02); 
for 0=0;i<nY^01;i++) for 0=Oa<nY^02y++) 
25 { 

if (UTL_MATH_URANDO > Fraction) continue; 
nProcessed++; 

MakeOnePrint( fuUCsbi , i, j, fingerPrint); 
CompareFingerPrint(Y_01Dl,Y_02Q],20,counls); 

30 } 

WriteMissingBits(20,counts) ; 
WriteMasterRecordO; 
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return 1; 

} 

CompareFing^Print( one, two, Nbins, bins) 
int *one, *two, Nbins, *bins; 
5 { 

undgned char *hl, *h2, ♦hS, *fing; 
int i, product, card, Icaid, Ibits; 
hi s= (unsigned char one; 
h2 = (unsigned char *) two; 
10 h3 = (unsigned char *) fIngerMask; 
fing = (unsigned char *) fingerPrint; 
Icaid = card - Ibits ^ 0; 

for(i=0;i<BytesPerFingcrPrint;i+ + , hl++,h2++,h3+4-,fing++) 
{ card +== nbits[ *hl j *h2 ]; 
15 Ibits += nbits[ *h3 & *fing ]; 

lcard+= nbits((*hl | *h2 ) &*h3]; } 
if ((card ^ fingerBits - card) < 0) goto NoWay; /* should be impossible */ 
if ((Icard = Ibits - Icard) < 0) goto NoWay; /* should be impossible */ 
if ( card > Mbits) Mbits = card; 
20 if acard > Lbits) Lbits = Icard; 
if (card > Nbins) card = Nbins; 
bin$[card] +=1; 
return 1; 
NbWay: 
25 return 0; 
} 

WriteMis^ngBits(n,counts) 
int n, ^counts; 

{ 

30 int i, sum; 
sum = 0; 

for(i=0;i<=n;i++) {printf("%d • %d; ",i,counts[i]); sum += countsp]; } 
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printfCXn"); 
if (sum ! = nProcessed) 
fjprintf(stdeiT, 

"Mismatdi indicates posabie mor in core entry .\nOnly %d of %d 
5 found.Vn", 

Slim, nProcessed); 

} 

/♦ File format of the "master record" is 
Reaction dass name 
10 Reaction specific name 

Number of varying sites ^= 2 so far 

Mbits 

Lbits 

*.core filaiame 
15 *.core index 

prefix, fjp 

number of fp records before 1st == 0 in this program 
XI filename 
X2 filename 

20 */ 

WfiteMasterRecordO 
{ 

FILE*^; 
char *hold; 

25 if {!(hold = UTL_STR^CONCATENATE(PrefixForFiles/.mr))) return 0; 
if (!(fp = UTL_FILE_FOPEN(hold/w-))) return 0; 
UTL_MEM_FREE(hold); 

if (!(hold = UTL_STR_CONCATENATE(PrefixForFiles/.fp"))) return 0; 
4f (MoreRxnInfo) 
30 fprintf(fp, "Reaction class 

%s%s\n%s\n%d\n%d\n%d\n%s\n%d\n%s\n%d\n%s\n%s\n\ 

ReactionCode, NulICore ? " NO_core" : UserRxnName, 2, Mbits, 
Lbits, Corefile, StartCore, hold, 0, Xlfile, X2file); 
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else ^rintf(^,"Reaction class 
Unknown\n%s\n%d\n%d\n%d\n%s\n%d\n%s\n%d\n%s\n%s\n*, 

PrefixForFiles, 2, Mbits, Lbits, Corefile, StaitCore, hold, 0, XI file. 
X2fiki); 

5 UTL_MQt4_FREE(hoId); 
UTL_FILE_FCLOSE(flp); 

} 

int sln_defines_csIn(void •♦c, char *filel, char *fae2) 
{ 

10 int numConnecdons = 0; 
char *oonnectionFile$[2]; 

if (filcl) { connectionFiles[ numConnections++ ] = filel; } 
if (file2) { connectionFiles[ nuinConnections++ ] = file2; } 
if (numConnecdons < 2) { fprintf(stderr/\nNo XI or X2 file - 
15 fmlureAn*); 

return 0; } 

*c = (void *) DB_CT_CCT^GET_PRD_INIT(CorcSln, Xrlist, numConnecdons. 

connecdonPiles); 

if(!*c) 

20 { fprintf(stderr/\nUnable to init"); return 0 ; } 
return 1; 

} 



WOy7/27559 PCT/US97/D1491 

286 
Appendix "H" 



/* Similarity - fonneriy dbcslnsim •/ 

5 /* mod to read from the ma^ file format (DEP 6/26/96) 

/* mod to read/write bitset files (DEP 9/19/96) */ 

/♦ mod to read $TA_MOLTABLES screendef file if not whm fp file points */ 

/* mod to take the *-q" format of input SLN «/ 

/• mod to use mask to improve searches (DEP 10/24/96) •/ 



10 f*******nf**^*^**mmmmmmmmt^*mmm0*m*mm**i***m^******mm*^^mmm^m^*m^m**m*m^m 
*/ 

/*+C * 

* This program evaluates (approximate) Tanimoto 2D similarity vs one cSLN 

* based on preprocessing of the substituent reagoits. 
15 * 

Input file is a mast^ file with one multiline record per cSLN, 

* Record format is 

* Reaction class xxxx (where "Reaction class" is a literal) 

* reaction_name 

20 * number_of_sv_sites 

* missing^bits_count (may be overridden by mask) 

* hashed_only_missing_bitSjCount 

* oore_filename 

* core_filename_indcx_of_core 
25 * fingerprint_filename 

* offset_into_;fmgerprint_file 

* first_sv_file_Xl 

* secod_sv_file_X2 (etc if more tfian two sv_sites) 
* 

30 * Queries are input as SLN repeatedly from stdin; ending on "^D or X 

* The option^ ASCII output file contains one line per hit, of the form 

* YlY2TTmax 
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' whm Yl = index of the substituent in XI. pro file 
' Y2 = index of the substituent in X2.pro file 
^ T = qiparent Tanimoto similarity 

^ Tmax = niaximum possible Tanimoto, given die slop bits (see telow) 

^ The (required) checkpoint file is in the standard CSR format, which can 
^ also be used instead of the master file to start a search. 

' Similarity *master <name> -bitset < name > -Tanimoto < real > -range <exp> 
* -index <int> -maxhits <int> -ou^ut <name> -checkpoint <name> 

' +debug 



Options: 



master name 



- name is the file with master file records 



-bitset name 



- name is a result of an earlier search operation 
(use EITHER master or bitset) 



-index number 



- which sequential record in master file to use 
OR offset into bitset in a bitset file 



Tanimoto tan 



- tan is a Tanimoto similarity 0.0 - 1.0 
(default is 0.85) 



-maxhits max 



stop when max hits are found (default infinity) 



-input filename 



- name of file with queries (default stdin) 



- single SLN query string 



HHitput filename - ^ifies the output file for the hit info 

(Mainly used for debugging - otherwise obsolete) 
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* -checkpoint name -file to which bitset results will be written 
» 

* -mask hex - hex format bitmask of missing bits <tan_hex form) 
♦ 

5 * -range range_exp - set of internal cSLN ids (Y J)l varies slowest) 



* for which sinularity will be ccmiputed. Ra]ige_e]q> 

* is a comma separated list of erne or more of the 

* following primitives: 
♦ 

10 * * - everything in the cSLN 

* 1-18 -ids 1,2,3, 18 

* 5-* - ids from 5 to the last in the 

* cSLN. 

* 17 - id 17 only 
15 * 

* -append - zppmd results to an existing ou^ut file 

By default an output file is overwritten. 

m 

* +debug - writes irrelevant info to stderr 
20 * 

* This flag forces the display of all 

* options 



25 / 

/♦ use 3db 

* dbcc Similarityx -o Similarity */ 
iNnclude <stdio.h> 
#include <signal.h> 
30 #include <ctype.h> 
^include <uni$td.h> 
jjfinclude <string.h> 
#include <sys/stat,h> 
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#tiiclude <niath.h> 

#include "parseopt.h" 

ftnclude "utl^str.h" 

finclude "uti^mOTi.h* 
5 #include "uU^filch" 

include "cth- 

#include "ct_expr,h" 

#include "ctjiioto.h" 

((include "importjwoto.h" 
10 Jinclude "commonDa^h*' 

static char ^^utFileName =0; 

static char *MasterFiie =0; 

static int MasterRecord; 

static FILE *MasterFile_FUe; 
15 static char *FngrFile; 





static int 


FingerCoic_Card; 




static int 


*FingcrCore_FP; 




static char 


*InputSouTce = 0; 




static char 


♦fiillQuery; 


20 


static char 


*BitsetHle; 




static char 


*CheckPointFileName; 




static char 


♦directtjuery = 0; 




static double 


Tantmoto = 0.85; 




static int 


AppendToOu^tFile = 0; 


25 


static int 


WordsPerFingefprint = 0; 




static int 


BytesPerFingerPrint = 0; 




static int 


CurrentSInId = 0; 




static int 


NoMorehitsPlease = 999999999; 




static char 


''DatabaseRangeString = 


30 


static int 


DebugLevei; 




static int 


UserAborted; 




static int 


Firsts Last; 




static int 


Pro_size ; 
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static diar *mASCII = 0; 

static int "^MaskMissingBits = 0; 

static int ^MaskQueryBits - 0; 

static struct ParseOi^ons OptionsQ = { 

5 

*** IX) NOT MOVE EhTTRIES IN TfflS TABLE. ADD ENTRIES ONLY AT TOE 
END. 
***/ 

{-master-, ParseOptString, &MasterFUe, 
10 -Name is the file with master file records*' }, 

{"bitset", ParseOptString, ABitsctFile, 

'Name is the file with bitset records*' }, 
{"Tanimoto", ParseOptDouble, &Tanimoto« 
•Similarity threshold (0.0 to 1.0)" } , 
15 {-index*, ParseOpUnt, &MasterRecord, 

"Which MastcrRecord entry l-n** }, 
{"maxhits", ParseOptInt, &NoMorehitsPlease, 

*'Maximum number of hits before stopping** }, 
{"input", ParseOptString, &InputSource, 
20 "File ftom which queries will be rcad( default stdin). "}, 

{"q", ParseOptString, &directQua:y, 

"Query string to use instead of a file or stdin"}, 
{"ou^f, ParseOptString, &Ou^utFileNan)e, 

"File to which ASH hit info will be written. OBSOLETE *}, 
25 {"checkpoint", ParseOptString, &GheckPointFileName, 

••File to which bitset info will be written."}, 
{"mask", ParseOptString, AmASCII, 

"Hex mask of missing bits" }, 
{"range", ParseOptString, &DatabaseRangeString, 
30 "Range of cSLN ids to compare to quay" } , 

{ " append" , ParseC^tNoArg, & AppendToOutputFile, 

"Use -s^pend to append results to an existing file" }, 
{"debug", ParseOptBoolean, &DebugLevel, 
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"Use +debug to enable debugging messages' }, 

}; 

int UBS_OUTPUT_MESSAGE0 { ledun 0; } /♦ just for compUing OK ♦/ 
int UIMS2_WRrrE_PHOrOO { return 0; } 
5 int lowercase (s) char ♦s; {whUe (•s) { if isupper(*s) *s = tolower(*s); s+ + ;}} 
static void UserHitControlCQ 
/*+! 

* This function is the agnal handler for uso- initiated protgram terminatiai. 

10 * It's only role is to set a flag indicating that the user wishes to abort the program. 

* Author Date Description 

* G. B. Smith 02-09-93 Original Version 
15 * 

*/ 

{ 

User Aborted = 1; 

} 



20 static int ParseAigunients( argc, argv ) 

* Hiis function parses the command line arguments. 

* Returns: 1 on a successful command line parse, 0 otherwise. 

* Warnings: 

* &iors: 

* See Also: 
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* Author Date Description 

* G. B. Smith 02-09-93 Qri^nal Version 
5 • 

•/ 

int sxfp; 
char **argv; 

{ 

10 int naigs, 

nopdons = sizeof( Options )/sizeof(Qptions[0]); 

nargs = UTL_PARSE_OPT( aigc, aigv, noptions. Options ); 

if( inargs ) goto SyntaxEnor; 

return!; 
IS SyntaxEnon 

r^m 0; 

} 

static int OpenOutputFileQ 
/*+I 
20 * 

* R^ums: 1 on sucesss, dse 0 
• 

*/ 
{ 

25 char *msg; 

FILE *fp; 

if( OulputFileName ) 

/• 

30 ** We need to create ou^jut files under the ownership of the REAL user not the 
** EFFECTIVE user. This only applies if setuid options are activated. 
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{ 

Struct Stat statBuff ; 
int uid ; 
int euid ; 

uid - getuidO ; 
euid = geteuidQ; 
stat(Ou^utFileName, AstatBufQ; 

/* 

There arc two cases 
** (1) the fUe to output to exists 

** Use the ownership of the current owner of the file or if you cant do that 

** do not do anything. 

** (2) The file is being created. 

use the ownership of the REAL user. 



if ( access(OutputFileName. F_OK) = = 0 ) 
{ /* If the file exist and the real user is the owner of the file */ 
if ( statBuff.st_uid == uid ) 
seteuid(uid); 

} 

else 

{ /* Create the file as the REAL user */ 
seteuid(uid); 

} 

OuQ)utFile = fopen( OutputFileName, (AppendToOuQ)utFile?"a":''wb*)); 
if( lOu^utFile ) { 

fprintf(stderr, "Error: Failed to open output file \-%s\"\n-, 
OutputFileName ); 

goto ErrorRetum; 

} 

} 

return 1: 
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EnorRetum: 

return 0; 

} 

static int PaiseRangeExpr( expr, maximum, low, high ) 

5 /*+I 
♦ 

* Function evaluates a structure range eiqnession. See the module 

* description in this file for a definition of stittctufe range expressions. 

•« 

10 ^ Returns: Function returns 1 if the expression is correct. If the 

* expression is incorrect 0 is returned. 

* ^ 

* Audior Dale Description 



15 »aB. Smith 02-12-91 Original Version 
♦ 

V 

char *expr, /* A structure range expression */ 

int maximum; /* Maximum structure number, 999999999 */ 
20 ^int ♦low; /* RETURN: low value in the range */ 

int ♦high; /* RETURN: High value in the range ♦/ 
{ 

diar *p; 

for( p=cxpr, *p Sc8c isdigitC^); p++ ); 
25 if(!*p){ 

sscanfi expr, "%d", low ); 

♦high = *low; 
} else if( 2 =" sscanf( expr, "%d-%d'', low, high)){ 

> 

30 } else if( 1 == sscanf(expr,"%d-*",low )) { 

*high = maximum; 
} else if( !strcmp( expr. "*")) { 
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*low = 1; *high - maximum; 
}dse{ 

fprintf(stdCTr. "ERROR: Invalid structure range \"%s\"\n", 
cxpr ); 

5 goto BadExpres^on; 

) 

if( ♦low < 1 ) { 

fprintf(stden, 

'ERROR: Structure range must be greater than zeio\n" ); 
10 goto BadExpres^on; 

} 

if( •high > maximum ) { 
fjprintf($tderr, 

"INFO: Specified range (%d-%d) is greated than the total number of 
15 structuresXn", *low, *1ugh ); 

*high = maximum; 

} 

if( *high < *low ) { 

fjprintf(stden-, "ERROR: Low range value (%d) is larger than high value 

20 («d)\n", 

*low, *high ); 
goto BadExpression; 

} 

r^um 1; ' 
25 BadExpresaon: 
return 0; 

} 

int main( argc, aigv ) 
/*+E 
30 * 
*/ 

int argc; 
char **argv; 
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{ 

char comline[2048]; 
long startrime, 
totalUme, 

5 finishTime; 

Establish handler for a us^ interrupt. 

signai( SIGINT, UserHitControIQ; 
10 #ifdef SIGHUP 

signal( SIGHUP, UserHitControlC); 

Jl^endif 

if( !ParseArguinaits( aigc, argv ) ) 
goto SyhtaxError; 

15 if (!ParseRange£xpr(I>atabaseRangeString, 999999999, &First, &Laist)) 

goto SyntaxError; 
First-; Last—; 

if OOp^OutputFileO) goto FailureExit; 
time( AstartTime ); 
20 Visua]((stdm,"Begin reading files: %s-,ctime(&startTime))); 

Let's actually do something now 

if (IReadEvcrythingO) goto FailureExit; 
time( &finishllnfie ); 

Visual((stderr,*B^in comparison: %s*,ctime(&finishTime))); 
25 if (!UserAborted && ICompareEverythingQ) goto FailureExit; 

if (OutputFile) fclose(OutputFile); 
time( &finishTime ) ; 
totalTime = finishTime - startTime; 
if( ItotaTTime ) totalTime =1; 
30 Visual((stderr, "Created %d Finger Prints in nProcessed )); 

Visiial((stderr,"%d Hours, %d min, %d secsXn", 
totalTime/(60*60), 
(totalTime%(60*60))/60. 
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(totairime%60))); 

Visual((stderr,"Each comparison required %.8f seconds to calculateXn*, 
(totarrime/((double)(nProcessed?nProcessed: 1))))); 

MakeConnLine(coniIiiie, 2048, argc, argv); 
S CheckPointProgram(comUne); 

Visual((stdOT,"End Finger Print Cominitadon: Xs'.ctimeC&fimshTime))); 

Us^Aborted ? exit(ErrorExit) : exit(GoodExit); 
SyntaxEnon 

exit(l); 
10 FailureExit: 

exit(ErrorExit); 

} 

int ReadEverythingO 
{ 

15 char ♦hold; 
char buff[256]; 
int i; 

int offset, size; 
void *bit«t=0; 

20 /♦ because failure here means aid program run, no effort to clean up 
memory on error is included. */ 
if (IMasterFile && IBitsetFile ) return 0; 
setbits_nbils_InitO; 

Totallnputs =1; /♦no provision for concatenated ♦/ 
25 tnputNames[01 = MasierFile ? Mast^File : BitsetFile; 

InputStartRecIO] = MastCTRecord; 

if (MasterFile && IMasterRecord) InputStartRec[0] = 1 ; 

if (GheckPointFileName) 
OutputCheclqK)intNamesfP] = GheckPointFileName; 
30 else 

{ sprintf(buff,"%s_%d_chk.bs",InputNames[0],0); 
OutputCheckpointNames[0] = UTL_STR_SAVE(bufO; 

} 
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nY_01 = nY_02 = 0; 
if (MasterFilc) 

{ if ( !RetrieveMasterFile(InimtNames[0], 
MastoFilc_Filc , 
InputStartRec[0], 

&(NufnMissingBits[0]), 

&(BitsInAbsentiaNoCaunt(0]), 

&(C(neFiieNaines[D]), 

&(CoreStart[0]), 

&FngrFile, 

&(Xlfilc(01). 

&(X2fileIP]), 

&(Y^OLLcngth[0]), 

&(Y^02^Lengtht01). 

&fingerFPIO]. 

&fingerOffsets[0]» 

&ScreenFileName« 

&BytesPerFingerPrint, 

&WordsP^Fingeiprint, 

&query, 

&FingerC6re_FP, 
&FingerCore_Caid ) ) 
goto UnabieToReadMaster ; } 

dse 
{ 

if ( !( bitset = CS_PRDCT_BITSEr_OPEN{InputNames[0], 

lnputStartRec[0])) ) 
goto UnableToReadBitset ; 
if ( !RctricvcMastcrFileFromBitsct(bitset, 

&(MastCTFile_Bitset[0]), 
&(StartRec.BitsetIO]). 
&(NumMissingBits[0]), 
&(BitsInAbsentiaNoCount[01), 
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&(CoreFileNames[0]), 
&(CoreStart(0]), 
AFngrFUe, 
&(XlfUe[OD, 



20 if (jWarmUpO) goto UnableToWamUp; 

ReraaininglnputfO] «SomcLcft = Y_01_Length[0] * Y_02_Length(0] ; 
Pro_size = ( 31 + SomcUft )/32 * 4; 
BLtMq>StaitPoint[0] = 0; 
if (IGocxT Pnxlucts) inidalize iff not already done */ 
25 {if (!(Good.Pioducts = (int^>im._MEM.AIljOC(Pio_.si2e))) return 0; 
memset( Good_Products,0,Pro_sizc); } 
if (!Dead_Products) /* initialize iff not already done */ 
{if (!(Dead_Products = (int *) UTL_MEM>LLOC(Pn)_size))) return 0; 
memset( Dead_Products,0,Pro_aze); 
30 if (bitset) /* assumes actuallsizes matches current sizeslV 



{ CS_PRDCT^BITSET^TO_RAW( bitset, Dead_Products, 0); 
not_here(Dead__Products,Pro_size ); 

} 



10 



15 



5 



&(X2file[0]), 

&(Y_01_Length[0D. 

&(Y_02_Length[P]). 

&fingerFP[0]. 

ftfingecOffset^], 

AScreenFileName, 

&BytesPerFingerPiint, 

&WordsPerFingerprint, 

&query, 

&FingerCore_FP, 
&FingeTCote_Card ) ) 
goto UnableToReadBitset ; 



nY_01 += Y_OI_LBngthtO] ; 
nY_02 += Y_02_Lcngth(01 ; 
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) 

if (! (YJl = C«nt •*) UTL.MEM_AlXOC(sizeof(int *) * nYjOl))) return 0; 
if (!(cY_Ol = Ont *) UTL_MEM_ALLOC(sizeofCint ) • nY_Ol))) return 0; 
ifOOYjOl -Qnt *) UTL_MEM_ALLCX:(si2eofOnt )* nVjOl))) return 0; 
5 forO=0*,i<nY_01;i++) 

{ 

if (! GetNextLine( cY_01 +i. Y_01 +i )) _retum 0; 

} 

10 if (! (Y_02 = Ont **) UTL_MEM_ALLOC(sizeofOnt *) * nY_02))) return 0; 
if 0(cY_O2 = Cult *) UTL_MEM_ALIjOC(sizBofCint ) * nYJK))) return 0; 
if myj2. = (int •) UTL_MEM_ALLOC(sireofant ) • nY_02))) return 0; 
for(i=0;i<nYj02;i++) 

{ 

15 if(! GetNextLine( cY_02+i,Y_02+i )) return 0; 
} 

return 1; 

UnableToWarmUp: 
20 iprintf(stderr/Unable to Read screen fUe\n"); 

return 0; 
Un^teToReadMasten 

iprintf(stderr, "Unable to Read master file\n"); 
return 0; 
25 UnabletoReadBitset: 

fiprintf(stdcrr, "Unable to Read Wtset fileXn"); 
Tetinm 0; 

} 

int WarmUpO 
30 { 

char *wheie_else,*name, *ext; 
int words; 
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if (!(fp = fopeii(Scn5enRleName,"r"))) 
{ 

whoedse = UTL,FILE_PARSE(ScrBenFileName,4); 

name = UTL_STR_CONCATENATE('sybyIbase/tables/",where_dse); 
5 UTL_MEM_FREE(where_dse); 

ext = ini._FILE_PARSE(ScieaiFOeNaine,S); 

wherejelse = UTL_FILE_COMPOSE_SPEC( •TA_ROOT", name, ext); 

if (!(^ = fopenCwbere.dse/r'))) return 0; 

UTL_MEM_FREE(where_clse); 
10 UTL_MEM_FREE(naine); 

UTL_MEM_FREE(ext); 

} 

ScrecnStnicture = Ont *) DB_Brn_PARSE_2DSCREEN(fp); 
fdose(fjp); ijp = 0; 
IS if (iScieenStnicture) letum 0; 
Currentlnput = 0; 

if (mASCn) /* generate binary missing bits */ 

( 

if ( (stFlen(mASCII) / 8) ! = WordsPwFingerprint) i^m 0; 
20 if (!(MaskMissingBits = fmt •) UTL^MEM_ALLOC( BytesPerFingerPrint))) 
return 0; 

if ('(MaskQueiyBits « Ont *) UTL_MEM_ALLOC( BytesPerFingerPrint))) 
return 0; 

for (words<=0;wonds<WoidsPerFingaprint;woids++) 
25 { 

memq)y(next8,niASCII,8); 
mASCn += 8; 

sscanf(next8,'%8x'', MaskMissingBits + words); 

} 

30 } 

return 1; 

} 

int MakeAFingerprint( sin, fingerPrint) 
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char *sln; 

int *fingerPrint; 

{ 

Struct CtConnectionTable *ct; 
S int nBitsSet; 

if (!(ct = DB_IMPORT_SLN(slii))) retum 0; 
monset ( fing^PrinU 0,-BytesPcrFingerPrint ); 

if( !DB_BIT2_EVALUATE( ct, ScreenStructure, fingerPrint, &nBitsSet )) 
return 0 ; 
10 return nBitsSet; 
} 

int G^extLine( pCard, pFP) 

int •pCard, **pFP; 

{ 

15 if (!(*pFP = (int UTL_MEM_ALLOC( BytesPerFingerPrint))) return 0; 
if (!UTL_FILE_FREAD( pCard.sizeofCint), 1 ,fingerH>[0])) retum 0; 
if (IUTL_FILE_FREAI>( *pFP ,sizeofOnt), WordsPerFingerprint ,fingerFP(0])) 

return 0; 
retum 1; 
20 } 

iiit IntersectQuery( plntr, pFP) 
int "^Intr, **pFP; 

{ 

unsigned char ♦ptr ,*qtr, 
25 int i, count; 

ptr = (unsigned char ♦) *pFP; 

qtr - (unsigned char *) query; 

for(count=0, i =0; i < WordsPerFingerprint*4;i + +) 
count += nbitst *ptr++ & *qtr+ +]; 
30 *plntr - count; 

return]; 

} 

int ComparieEverythingO 
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{ 

int cqt, qjo, qjii, i, j, carhold, inthold, onion, intsc, countinput; 

double max; 
countiiiput = 0; 
5 if ( ! diiectQuery ) 

{if (fliqHitSource) inputSourc^ile = stdin; 

if (! (loputSourc^ile = fopenOinputSource/r*))) return 0; 

} 

10 while ( directQuery ? 

((fiiHQuery = directQuery) && countinput == 0) : 

M != UTL_SCAN_GETS( InputSourceFile, "W", , AMlQuery))) 

{ 

countinput++; 

15 if (! (c_qiiery = MakeAFmgerprint(fullQiiay .query) )) return 0; 

if (MaskMissingBits) R^umMissingBits(l); 

. forO=0;i<nY_01;i++) 

if (! IntersectQoery( iY_01 +i,Y_01 +i )) return 0; 
20 for fi«0;i<nY_(K;i++) 

if (! IntersectQuery( iY_02+i,Y_02+i )) letum 0; 
CumraitSlnld = 0; 

cqt = £lopr< (double) cjjuery / Tanimoto); 

q_lo = floor( (cbuble) cjqueiy * Tanimoto - (double) NuinMissingBits[0]); 
25 q_hi = cdl( (double) ( c_query + NuinMissingBits[0]) / Tanimoto); 

/* diould convert test of Dead_Products to a "UTL_SET_NEXT" approach ?? */ 
for(i=0;i<nY_01;i++) 
{ 

if (CurrentSlnId > Last) bieak; 
30 if (cY_Oini > cqt) { CurrentSlnId += nY_02; continue;} 

carhold = q_Io - cY_01(i]; 
inthold = q_lo - iY_0l(i]; 
fora=0u<nY_02j++) 
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{ 

if (Us^'Aborted) return 1; 

if (CurrentSlnld > Last) break; 

5 if (GunentSlnId < First) { CiinentSliiId++; continue; } 

if (cYJttOl > cqt) { CunentSlnId++; continue; } 
if (cYjnU] < carhold) { CunentS!nId++; continue; } 
if Cinthold > iY_O201) { CurrentStaId++; continue; } 
if (TestI>ead(0,CunentSlnId)) { CurTentSlnId++; continue; } 

10 ActuailyComputeC i» j, &onion, &intsc, &max); 

if (max > = Tanimoto) 
{ 

Ou^HitThisHitOJ.onion, intsc, max); 
nProcessed++; 

IS if (nProcessed > = NoMorehitsPlease) return i; 

} 

CunentSlnId-i-+; 
} /♦ Y_02 loop •/ 
} /• Y_01 loop •/ 
20 } /* while stil queries left */ 
return 1; 

} 

int ReNumM issingBits( int howmany ) 

{ 

25 for ( ; howmany ; howmany-) 

ReNum(MaskMissingBits,query,WordsPcrFingerprinty&(NumMissingBits[ho^ 

); 

} 

int Rd4um(int *mask, int*query, int len, int *missed) 
30 { 

unsigned char *one, *two; 

unsigned char '■'masq; . 

liiasq = (unsigned char ^) MaskQueryBits; 
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one = (unsigned char *) mask; 
two = (unsigned char *) query; 
^missed = 0; 
len *= 4; 

5 fOT ( ; len ; len-) ^missed += nbitsi (*niasq++ = ♦one++ & ♦two++) ]; 
return 1; 

} 

int ActualiyCompute( index 1, index2, pUnicm, pintersection, pMaxTan) 
int index 1, index2, *pUnion, ^Intersection; 
10 double *pMaxTan; 
{ 

int i, product; 

unsigned char *hl, *h2, •hquery, *masq; 

int nuMissing; 
15 if (DebugUvd) 

fprintf( stdcnr," ActuallyCompute at %d , %d\n", index 1. index2); 

hi = (unsigned diar *) Y_01[indexl]; 

h2 = (unsigned char *) Y_^02[index2]; 

hquery = (unsigned char *) query; 
20 ^Union = ^plntersection = 0; 

if (mASCn) {nuMissing = 0; masq = (unsigned char *) MaskQueryBits; } 

dse {nuMissing = NumMissingBits[0];} 

for(i^O; i<WordsPerFingerprint*4;i++) 
{ 

25 product = *hl + + | *h2++ ; 

*pUnion += nbits[ product j ♦hquery]; 
if (m ASCII) 

nuMissing nbits[ -product & *masq-f +]; 

♦pintersection += nbits[ product & *hquery++]; 

30 } 

if (DebugUvel > 9) fiprintf(stderr,"%d / %d %6.3f\n", 
*pIntersection, *pUnion, 
(double) *pIntersection / *pUnion); 
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return (*pMaxTan - (double) (*pIntasection + nuMissing) / (double) '^Unkm); 
} 

int OuQHilThisHitC index 1, index2, onion/ intsc, maxlan) 
int indexl, index2, onion, intsc; 
5 double maxtan; 

{ 

if (Ou4>utFile) 

^ntf(Ou^(File/%6d %6d %5.3f %5,3f\n% ind^l + 1 ,index2+l , 

(double) intsc / (double) onion, 
10 maxtan); 
just note in bitset as a hit */ 
FlagProduct(Good_Pitxlucts, indexl, index2, 0); 
return 1; 

> 

15 static int not_here( what, nbytes ) 

unsigned char ^hat; ^ 
int nbytes; 

{ 

for ( ; nbytes; -nbytes) *what++ = -*what; 
20 return 1; 

} 

/* this belongs in the uti module, actually 

int MakeComLine( char *line, int len, int argc, char **argv) 

{ 

25 int i; 

sprintf(linc,"%s ",argv[0]); 
forO = 1 ;i < argc;l + +) 
{ 

line += strlenOine); 
30 sprintf(line,"%s ".argvp]); 

} 

} 

CheckPointProgram(programNanie) 
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char *piogramName ; 
{ 

int sizes[2] , size; 
int aUocSizes(2] ; 
S int nuinInSites[2] ; 
char hold[81] ; 
int i ; 

void ^compressed ; 
int total ; 

10 for ( i = 0 ; i < Totallnputs ; i++ ) 

. { 

sizes[0] = Y^Ol^Lengthpl ; 
sires[l] = Y_02_Laigthn] ; 
numInSites[0] = numInSites[l] -1 ; 
15 aIlocSizes[0] = aIlocSizes[l] = -1 ; /* should keep bitset 

allocSizes if present?*/ 

compressed = NIL; 
total = 0; 

WriteOutaieckPointFile(OutputCheckpointNames^^^ 
20 MasterFile ? InputNamesfi] 

: MasterFile_Bitset(i]. 
MasterFile ? InputStartRec[i] 
: StartRec_Bitset[i], 

prpgramName, 

25 Good_Products, 

BitMapStartPoint[i], 
2, 

sizes, 
aliocSizes, 

^ Selectionsti], 

numlnSites, 
total, 

compressed); 
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} 

} 
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Appendix "1" 

V 

5 I* dbcsliiquicksdecl 
V 

/*+c 

10 * This program evaluates (s^ioximate) Tanimoto 2D similarity vs cSLNs 

* based on preproces^g of the substituent reagents. Using this, it 
selects a diverse set of {Hoducts while trying to maximize use of 

* some groups. 

15 *ToDo: 

* Following ADS group suggestions, order the reagent fjp by size (fpcaid). 
To be added: restart capability and reag^t blackout. 

20 * The input fdes, one per XI , X2, have one line per 

* structure and contain the elements •fpcard=xxx;- and "fjp^zzz;" where 

* the terminating may also be The integer value of fpcard is 

* the cardinality of the fingerprint; the hex value of fp is the 

* fingerprint bitstring as two asdi bytes per bitset byte; 
25 * 

* Queries are input as SLN rq)eatedly from stdin; ending on ^D or X 

* The resultant fde contains one line per hit, of the form 

* Yl Y2 T Tmax 

30 * where Yl = index of the substituent in Xl.prp file 

* Y2 = index of the substituent in X2,pro file 

* T = apparent Tanimoto similarity 

* Tmax = maximum possible Tanimoto, given the slop bits (see below) 
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dbcslnquickselect -prefix <naine> -Tanimoto <real> -jwefer <what> -ZT^pcnd 
* -slop <int> -maxhits <int> -ou^ut <iiaine> +dd)ug 



5 * Options: 



10 



-prefix name - name is the prefix for a set of^ files 
with exten^ons •Xl.pro .X2.pro 
; files have fingerprints 
(someday) will reload finom prefix, RELOAD if present 



* -Tanimoto tan - tan is a Tanimoto similarity 0.0 - 1.0 

* (default is 0.85) 



15 * 

* 

20 * 
♦ 

25 * 
30 * 



-pref^ - one of R1,R2 else random. Rl maximizes use of Rl 

-slop bitcount - bitcount is the number of bits in the 
product fingerprint that may not be 
represented by ORinf XI X2 (default 0) 

-maxhits max - stop when max hits are found (default infinity) 

-output filename - ^lecifies the output file for the hit info 
by default results are sent to stdout. 

-append - app^d results to an existing output file 

By default an output file is overwritten. 

+debug - writes irrelevant info to stderr 

-rangevar - List of field names and ranges to filter 

the final list with, 
-oneof - List of field names and values that the product 
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should match in order to be considered. 
This flag forces the display of all 

I 

/*use3db 

* dbcc dbcslRquickselect.c -o dbcslnquidcselect ^1 

10 iHnclude <stdio.h> 

ifinclude <signal.h> 

#include <ctype.h> 

include <unistd.h> 

delude <string.h> 
15 finciude <sys/stat.h> 

X^nclude <math.h> 

#include "parseopLh** 

finclude "utl^str.h" 

ifinclude "uti^mem.h" 
20 #incliide "uU^file h" 

^Include "utl^math^h" 

^include "cth" 

#include "O.expr.h" 

#inclu<fe "ctjroto.h" 
25 #include "importjjroto.h" 

#define GoodExit 0 
#defme BrorExit 1 

ji^deflne Visual(s) { fprintfs; } 



* 

5 • 



#define ALLOCATE_INCREMENT 5 
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#define MISSING_FLOAT_VALlIE -100000000.00 
#define MISSING_INT_VALUE -1 
#define NOT_A_MATCH_VALUE -2 

ifdefine SMALL_FLOAT 0.00001 



5 /♦ 

** Command line aiguimnt -langevar and -oneof aie kept here. 
*/ 

static char *RangeVar ; 
static diar ^OneOfVar ; 

10 /* 

Structure to hold the field name(inside the nnn.x? files) and the allowed 
** range fw ttat field. 
•/ 

typedef struct RangeStruct 
15 { 

char ^RangeFieldName ; 
float lowValue ; 
float high Value ; 
} RangeStruct ; 

20 int NumRangeFields ; 

int NumRangeFieldsAUocated ; 
RangeStruct *RangeFields ; 

/* 

Structure to hold the field name and a list of values for the selection 
25 ** type fields. 
*/ 

typedef struct OneOfStruct 
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{ 

char *OneOfFiel<lNaine ; 
int numValues ; 
int numValuesAlloc ; 
char **valucs ; 
} OneOfStnict ; 

int NumOneOfFieldsAUocated ; 

int NumOheOfFidds ; 

OneOfStnict *OneOfValues ; 

float **RangeValues_Y01 ; /* Actual values read in from nnn.Xl file. 
If MW is the first and Ipgp is the second value 
^pedRed on the -rangevar argument list then 
RangeValues_Y01[n][0] would keep the value for MW 
for the nth line in the nnn.Xl file and 
RangeValues_Y01(n](l] would keep the value for 
logp for that line*/ 

float **RangeValues_Y02 ; /* same */ 

int **OneOfValues_Y01 ; /*Actual values read from nnn,Xl files but translated 
into an index of dneOfVaIues[i]. values so 
we dont have to waist memory and time doing strcmp*/ 

int **0ne0fValues_Ya2 ; /* Same */ 

static FILE *OutputFile; 
static char *OutputFileName; 



static char 
static int 
static int 



♦WhatPirst; 
Whatl = -1; 
What2; 
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stadc char *PrcfixForFilcs; 
static char *InputSource = 0; 

static FILE *InputSouroeFile; 



I* Code presumes that an int is 32 bits. ASCII-ed into %.8x format ^1 



10 



static int 
static int 
static int 
stadc int 
static int 
static int 
static int 
stadc int 
stadc int 
stadc int 



/^^Aigciprints */ 
/* " */ 
/* - */ 
/* number of structures */ 
/* - */ 
/* cardinality of fmgerprints */ 
*/ 



**Y_.01; 
•*Y.02; 
•query; 
nY_01; 
nY_02; 
*cY_01; 
♦cY_02; /♦ 
c_qucry;/* " */ • 

*iYj01; /* intersection count of fprints */ 
*iY_02; /* " */ 



IS stadc int 
^adcint 
stadc int 
stadc int 
stadc int 

20 stadc int 



*Good_l; 

*GQod_2; 

♦Dead^l; 

*Dead_2; 

*Good_Products; 

*Dcad_Products; 



stadc int 
stadc int 



nbitsI256]; 
setbits[8]; 



stadc double 
stadc int 
25 stadc int 
stadc int 
stadc int 
stadc int 



Tanimoto = 0,85; 
BitsInAbsentia = 0; 
AppcndToOuq)utFilc = 0; 
WordsPerFingerprint = 0; 
BytesPerPingerPrint = 0; 
NoMorehitsPlcase = 999999999; 
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static int 
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Dd>ugLevel = 0 ; 
User Aborted; 



static int 
static int 
S static char 



nPiocessed = 0; 
SomeLeft; 

ne)tt8[10] = ■01234567\0"; 



static struct ParseOptions (^onsQ = { 
/*** 

DO NOT MOVE ENTRIES IN THIS TABLE. ADD ENTRIES ONLY AT THE 
END. 

10 

{"prefix", ParscOptString, APrefixForFUes, 
"Prefix for all input files" }, 

{"Tanimoto", ParseOptDouble, &Tanimato, 
"Similarity threshold (0.0 to 1.0)" }, 

15 {"slop*, ParseOpdnt, &BitsInAbsentia, 

"Number of potentially missing bits in product fp" }, 

{"nmhits", PaweC^tlnt, &NoMordiitsPlease, 
"Maximum number of hits before stcqn>tng" 

{"input", ParseOptString, &InputSource, 
20 "File from which queries will be read( default stdin). 

{"ouqjut", ParseOptStiing, &OutputFileName, 
"File to which hit info will be written. "}, 



{"prefer", ParseOptString. ftWhatFirst, 



WOW/27559 PCTAJS97/01491 

316 

"One of Rl, R2 to maximize us of."}, 

{"append" , Parse(H>tNoArg, &AppendToOu^tFile. 
"Use -aiq[>end to append results to an existing file" }, 

{"dd>ug", ParseOptBoolean, &I>ebugLevel, 
5 *Use +debug to enable dd)ug^ng messages" }, 

{"rangevar", ParseOptString, &RangeVar, 

"Scalar field name and range to filter out, i.e. logp -1.0 8.0 MW 200 500 
priceO 12.50" }, 

{"oneoP, ParseOptString, &OneOfVar, 
10 "Field name and list of values that the product should matchVn, i.e. supplier 

Aldrich,Sigma,Fluka,SALOR taste SWEET.Salty- }, 



}; 



int UBS_OUTPUT_MESSAGE0 { return 0; } /* just for compiling OK */ 
int UIMS2_WRrrE_PHOTrO0 { return 0; } 
15 int lowercase (s) char *s; {while (*s) { if isupper(*s) *s = tolower(*s); s++;}} 

static void UserHitControlCO 

This fimction is the ^gnal handler for user initiated program termination. 
20 * It's only role is to set a flag indicating that the user wishes to abort the program. 



25 



{ 

User Aborted = 1; 

} 
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mm 
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5 ** Abstract 

mm 
mm 
mm 
mm 

10 ** 

** Usage 
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: Function parses range fidd string for ADS design programs. 
It takes a string of the form 

MQgp -1.0 8.0 MW 200 500 price 0 12.50" and fills in the 
global array RangeFields. 



Returns : 1 on success, 0 for failure. 



15 ** Algorithms : None. 
*♦ ■ 

** Revision History : 



mm 

20 **-E: 
*/ 

int ParseRangeVar(rangeVar,numRangeFieldsAllocated,numRangeFields,rangeFields) 
char *rangeVar ; 
int ^numRangeFieldsAllocated ; 
25 int ^umRangeFidds ; 

struct RangeStruct **rangeFidds; 
{ 

static int stat = 0 ; 
diar *bufifer - (char *)NULL ; 
30 char *name ; 
char *low ; 
char *high ; 
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int i ; 

^umRangeFiddsAllocated - 0 ; 

^numRangeFidds = 0 ; 

*rangeHdds = (strua RangeStnict •)NULL ; 

-5 if ( !(buf6er « UTL_STR_SAVE(RangeVar)) ) 

goto Failure ; 

nan» = strtok(buffer/ "); 

while ( name ) 

{ 

10 if ( !(low = strtok(NULL,* ")) ) 

goto UnableToParse ; 
if ( !(high = strtok(NULL," ")) ) 

goto UnableToParse ; 
if ( ^numRangeFields > = ^niimRangeFieldsAliocated ) 
15 { 

if (!*iaiigeFields) 
{ 

if (!(*iangeFid<ls = (strua RangeStnict 

*)UTL_MEM_CALLOC( 

20 

ALLOCATE_INCREMENT, 

sizeof(stnict RangeStnict)))) 
goto Failure ; 

else 

25 *^umRangeFiddsAlIocated = 

ALIjOCATE_INGREMENT ; 
} 

else 

{ 

30 if (!( *rangeFields = (struct RangeStruct 
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♦)UTL_MEM_RECALLOC( 

RangeHelds, 

(^numRangeFiddsAllocated *sizeof(struct RangeStruct)), 
((^numRangeFiddsAUocatBd + ALLOCATE_INCREMENT) * 
5 sizeof(stn]ct RangeStnict)) )) ) 

goto Failure ; 

else 

"^umRangePieldsAUocated + = 

ALLOCATE^INCREMENT ; 

10 } 
} 

RangeFields[^uinRangeFields].RangeFieldName = 
UTL_STR_SAVE(namc); 

RangeHelds['*1iumRaiigeFieIds].lowValue = atof(low); 
15 RangeFields[^umRangeFields].highValue = atof(high); 

(*numRangeFields)++ ; 

name strtok(NULL,- 

} 

V 

if (DebugLevel) 
20 { 

for ( i = 0 ; i < *numRangeFields ; i++ ) 
{ 

^rintf(stderr/\n%s %f %r, 
RangeFields[i] . RangeFieldName, 
25 RangeFields[i].lowValue, 
RangeFieIds[i].highVaiue); 

} 

} 

Stat =1 ; 
30 goto Cleanup ; 



UnableToParse: 
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^rintf(stderr/Unable to parse -rangevar %s\n",RangeVar); 
Stat = 0 ; 
goto Cleanup ; 
Failure : 
5 Stat = 0 ; 

goto Cleanup ; 
QeanUp : 

if (buffer) 

UTL_MEM_FREE(buffer); 
10 return stat ; 
} 

/* 

mm 

15 

** Abstract : Function parses one of field string for ADS design programs. 

It takes a string of the form 

** "supplier AldriehiSigma,Fluka,SALOR taste SWEET.Salty" 

** global array OneOfValues. 

20 ** 
** 

** Usage 

♦* Returns : 1 on success, 0 for failure. 
25 ** 

** Algorithms : None. 

mm 

** Revision History : 

mm 

30 ** 



**-E: 
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*/ 

Static int 

PaneOneOfVai<oneCtfVar,numOneOfFieIdsAUocated,^ 
char ♦oneOfVar ; 
S int "^^umOneOflnddsAllocated ; 
int *numOneOfRdds ; 
struct OneOfStruct **oneOfValues; 
{ 

static int Stat = 0 ; 
10 char *buffer = (char *)NULL ; 

char *name ; 

char '■'choices ; 

char *choioe ; 

int i ; 
15 int j ; 

char *cp ; 

char *end ; 

♦numOneOfFiddsAUocated = 0 ; 
*numOneOfFidds = 0 ; 
20 (*oneOfValues) = (struct OneOfStruct •)NULL ; 

if ( !(buffer = UTL_STR_SAVE(OneOfVar)) ) 
goto Failure ; 

/* 

Start off by reading the field name , 

25 */ 

name = strtok(buffer,'' 
while ( name ) 

{ 

if ( '*nuni(>ieOfFields >= *numOneOfFieWsAUocated ) 
30 { 
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if ( !(*oneOfVa]ues) ) 
{ 

if (!(*oneOfValues = (struct OneOfStruct 

*)UTL_MEM_CALLOC( 

5 

ALLOCATEJNCREMENT. 

sizeof(stn]ct OneOfStnict)))) 
goto Failure ; 

dse 

10 *nuinOneOfFieldsAlIocated = 

ALLOCATEJNCREMENT ; 
} 

else 

15 if (!( *oneOfValues = (struct OneOfStruct 

*)UTL_MEM_RECAmx:( 

*oneOfValues, 

(♦nutnOneOfFieldsAllocated *si2eof(stnict OneOfStruct)), 
((*numOneOfFieldsAllocated + ALLOCATE^INGREMENT) * 
20 sizeof(struct OneOfStruct)) )) ) 

goto Failure ; 

else 

♦numOneOfFieldsAUocated + = 

ALLOCATEJNCREMENT ; 

25 } 

} 

(*oneOfValues)[*numOneOfFields].OiieOfFieldName = 
UTLJSTR_SAVE(name); 

(*0ne0fValues)[*numOne0fFieIds].num Values = 0 ; 
30 (*oneOfValues)[*numOneOfFieIds],numValuesAlloc = 

ALLOCATEJNCREMENT ; 

if ( !((*oneOfValues)[*nuniOneGfFiddsl.values = (char •»**) 
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UTL_MEM_CAUXX:(ALLOCATEJNCREMENT, 
sizeof(char *)) ) ) 

goto Failure ; 

5 /* 

Now look at the choices this field could have. 

*/ 

dioices = strtok(NULL/ 
if ( !ch(Mces ) 
10 goto UnableToParse ; 

choice = sti1ok(dioices//); 
while ( choice ) 

if ( (*oneOfValues)[*nuinOnebfFields).numValues > - 
IS (*oneOfValiies)[*nuinOneOfFields].nuinValiiesAIIoc ) 

{ 

if ( !((*oneOfValues)[*numOneOfFields].values = (diar **) 

UTL_MEM_RECALLOC((*oneOfValues)[*numOneOfFidds].values, 

( 

20 (*oneOfValues)[*numOneOfFields].numValuesAlloc * 

sizeof(diar *)), 
( ((*0neOfValues)(*num0ne0fFidds).numValuesAUoc + 

ALLOCATE^INCREMENT ) 

*sizeof(char *)) ) )) , 
25 goto Failure ; 

(*oneC)fValues)[*^umOneOfFields].numValuesAlloc+ = 

ALLOCATEJNCREMENT; 

} 

(*oneOfValues)[*numOneOfFields].values[(*oneOfValues)[*numOneOfFields].numValuesl 
30 = UTL_STR_SAVE(choice); 
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(*oneOfValues)[*numOiieOfFiclds].nuinValues++ ; 
end = choice + $trien(choice) + 1 ; 
choice = strtok(NULL//); 

} 

5 (*numChieOfFidds)++; 

name = stitokCend/ 

} 

if (DebugLevd) 
{ 

10 for ( i = 0 ; i < *numOneOfFidds ; i++ ) 

{ 

fprintf(stderr,*\n%s % (♦oneOA^alues)[i].OncOfFieldName) ; 
for ( j = 0 ; j < (*oneOfyalues)ril.numValiies ; j + + ) 

fprintf(stdOT,-\n %s",(*oneOfValues)[i].valuesQ]); 

15 } 

fprintf(stdcrr,*\n"); 

} 

Stat = i ; 

Cleanup ; 

20 UnableToParse: 

fjprintf(s!derr/Unable to parse *oneof %s\n",OneOfVar); 
Stat = 0 ; 
goto Cleanup ; 
Failure : 
25 Stat = 0 ; 

goto Cleanup ; 
Cleanup : 

if (buffer) 

UTL_MEM_FREE(buffer); 
30 return stat ; 
} 



wo 97/27559 FCT/DS97/01491 

325 

/* 

•*H-E: 
mm 

mm 

5 Abstract : Function parses a line firom the input file and extracts 
out any rangevar or oneof fields. 

' mm 

Usage : 

10 

Returns : Always returns 1 ; 

mm 

** Algorithms : None. 
** 

15 ** Revision History : 
mm 

mm 

♦*-E: 
♦/ 

20 int ReadUneAttributcsOine,numRangeFields,rangeValues,rangeFidds,numOneOfFields, 

oneOfValues,oneOfFields) 

char *Une ; 

int numRangeFields ; 

float **rangcValues; 
25 struct RangeStruct *rangeFields; 

int numOneOfFields ; 

int **oneOfVaiues; 

struct OneOfStruct *oneOfFields; 

{ 

30 int i ; 
int j ; 

char *cp ; > 



WOy7/2tS59 PCr/US97/0149I 

326 

/• 

** Now read in the salar selection fields if any. 
*/ 

if ( numRangdfnelds ) 
5 { 

if ( !(*rangeValues = (float •)UTL_MEM_CALLOC(numRangeFields, 

azeof(float)) ) ) 

return 0 ; 

10 } 

if ( numOneOfFields ) 
{ 

if ( !(*oneOfValues = (int ♦)UTL_MEM_CALL0C(numOneOfFields, 

IS azeof(int)) ) ) 

letum 0 ; 

> 

for ( i = 0 ; i < numRangeFields ; i++ ) 

{ 

20 if ( ( cp = strstr(line,rangeFiddsp]^RangeFieldNaine ) ) ) 

{ 

/• 

** Move past the logp= to get the value of this field, if the value is 
** a ';' then it is a missing value. 
25 ♦/ 

cp = cp + strien(iangeFidds[i].RangeFieldName) 4- 1 ; 

ifCcp==V) 

(*rangeValues)[i] = MISSING_FLOAT_VALUE ; 

else 

30 (*rangeValues)ri] = atof(cp); 

} 

else 
{ 
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(*rangeValues)[i] = MISSING_FLOAT_VALUE ; 

} 

} 

/* 

' S ** Parse the -aneof field, we are looking for som^ing looking like 
** "siq>plier=Aldrich'' 
•/ 

for ( i = 0 ; i < numOneOfFidds ; i++ ) 
{ 

10 if ( ( cp = strstr(line,oneOfFields[i].OneOfFieldNanie ) ) ) 

{ 

q) = q> + strienConeOfFieldsDl.OneOfFieldName) + 1 ; 
if(*cp ==•;•) 

(*oneOfVaIues)Iil = MISS1NG_INT_VALUE ; 

15 else 

{ 

for ( j = 0 ; j < OneOfValues[i].numValues ; j++ ) 
{ 

if ( UTL_STR_NCMP_NOCASE(cp, 
20 oneOfFiddsUJ.valuesOl. 

strlen(oneOfFieldsri].va]ues[j])) = = 0) 
{ 

(•oneOfValues)[i] = j ; 
break; 

25 } 

} 

if ( j = = oneOfFidds[i].numValiies ) 

(*oneOrValues)[i] = NOT_A_MATCH_VALUE ; 

} 

30 } 

else 

(*oneOfValues)[i] = MISSING_INT_VALUE ; 
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} 

/• 

••+E: 
5 *• 

Abstract : Function Checks to see if the given product passes the 

user suiq>lied filters. 
•» ^ 

10 

** Usage : 

mm 

Returns : 1 if the product is not within range, 0 otherwise. 

*» 

15 ** Algorithms : None. 

mm 

** Revi^on History : 

' mm 
mm 

20 **-E: 
♦/ 

static int 

NotWitlunScalarRange(firstIndex,secondIndex,nuniRangeFields,ran^^ 

ues_Y02,rangeFields,numOneOfFields,oneOfValues_Y01,oneOfValues_Y02,oneOfValuK^ 
25 int firsandex ; /* Index into Y_01 data */ 

int secondlndex ; Index into Y_02 data */ 

int numRangeFidds ; 

float **rangeValues_Y01 ; 

float ♦*rangeValues__Y02 ; 
30 struct RangeStnict *rangeFields; 

int numOneOfFields : 
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int **oneOfValues_Y01 ; 
int **oneOfValues_Y02 ; 
struct OneOfStnict *oneOfValues ; 
{ 

5 int i ; 
floa^ total ; 



/* 

First check the range values. 

*/ 

10 for ( i = 0 ; i < numRangeFidds ; i++ ) 

{ 

/* 

If one of the regions has a missing value, then we do not filter this 
** product. 
15 */ 

if ((( rangcValues_Y01(firsandexl[il - MISSING_FLOAT_,VALUE ) 
= = SMALL_FLOAT) || 

(( rangeValues_Y02[secondIndex][i] - MISSING_FLOAT_VALUE ) 
= = SM ALL^^FLOAT ) ) 
20 return 0 ; 

total =rangeValues_Y01[firstIndex]Ii] + rangeValues_Y02[secondIndex][i]; 
if ((total > rangeFiddspJ.high Value ) j | 
(total < rangeFields[i].lowValue ) ) 

{ 

25 return 1 ; 

} 

} 

for ( i = 0 ; i < numOneOfFidds ; i++ ) 
{ 



30 /• 

mm 



If the value is missing then we dont mess with this guy. 
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♦/ 

if ( ( oneOfValues_Y01[firsan(lex][i] == MISSING_INT_VALUE ) | j 
( oneOfValues_Y02(secondIndexl[i] == MISSING JNT_VALUE ) ) 
return 0; 

5 /* 

** If any of the regitms in the product does not match the selection 

** oitoia; then the product is rejected. 

•/ 

if ( ( oneOfValues_Y01[firsandex][i] = = NOT_A_MATCH_VALUE ) 1 1 
10 (oneOfValues_Y02[secondIndcxl[il == NOT_A_MATCH_VALUE ) ) 

return 1 ; 

} 

return 0 ; 

} 

15 static int ParseArguments( argc, argv ) 

* This function parses the command line arguments. 

20 * Returns: 1 on a successful command line parse, 0 otherwise. 
♦ 

* Wanungs: 

* &rors: 
25 * 

* See Also: 

30 */ 
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int argc; 
char **argv; 

{ 

int nargs, 

5 motions = sizeofC Options )/sizeof(Options[0]); 

OutputFile = stdout; 

nargs = UTL^PARSE_OPT( argc, argv, noptions, Options ); 
if( !nargs ) gcHo SyntaxError; 

if (WhatFirst) 

10 { if (strstr(WhatFirst/Rl")) WhatFirst[0]=*r; 

if (strstr(WhatFirst/R2")) WhatFirst[0J='2'; 
} else{ 

WhatFirst=UTL_MEM_AUX>C(2); WhatFirstIO]='0'; } 



if (RangeVar&4fe ! 

15 ParseRangeVar(RangeVar,&NumRangeFiddsAUocated,&NumRa^ 
goto SyntaxError ; 
if ( OneOfVar && 
!ParseOneOfVar(&OneOfVar,&NumOnebfFieW^^ 
ues)) 

20 goto SyntaxError ; 

return 1; 

SyntaxError: 
return 0; 

} 



25 static int OpenOu^tFileQ 
/♦+! 
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* Returns: 1 on sucesss, else 0 
*/ 

5 { 

char *msg; 
FILE *fp; 

OutputFile = stdout; 
if( OutputFileName ) 
10 { 
/* 

** We need to create output files under the ownership of the REAL user not the 

** EFFECnVE user. This only ^pp^ies if setuid options arc activated. 

*/ 

15 { 

struct Stat statBufT ; 
int uid ; 
int euid ; 

uid = getuidO ; 
20 euid = geteuidQ; 

stat(OutputFileName, &statBuff); 

f* 

There are two cases 
** (1) the file to output to exists 
25 ** Use the ownership of the current owner of the file or if you cant do that 
** do not do anything. 
** (2) The file is bdng created. 
** use the ownership of the REAL user. 
*/ 
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if ( access(OutputFileNatne, F_OK) = = 0 ) 
{ /* If the file exist and the real user is the owner of the file */ 
if ( statBuff.st.uid == uid ) 
seteuid(uid); 

} 

else 

{ /* Create the file as the REAL user */ 
s^uid(uid); 

} 

OutputFtle = f<^>ra( Ou^utFileName, (AppendToOuQ)utFile?"a":"wb")); 



if( lOutputFUe ) { 

fiprintf(stderr,"Erron Failed to open output file \"%s\"\n", 

Ou^utFileName ); 
goto ErrorRetum; 

} 

} 

leturn i; 



EirorR^m: 

return 0; 

} 



static void CloseOutputFileQ 
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This fiinctim closes the output file. It is included just for cleanliness. 

5 */ 
{ 

fclose( OutputFile ); 



int main( argc, argv ) 
10 /*+E 

*/ 

int aigc; 
char **aigv; 
15 { 

long startTin^, 
totalTime, 
finishTime; 

int numFiltered = 0 ; 

20 

*** Establish handler for a user interrupt. 

signal( SIGINT, UserHitControlC); 
#ifdef SIGHUP 
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signal( SIGHUP, UserHitControlC); 

#endif 

if( !ParseAiguinents( argc, aigv ) ) 
goto SyntaxEnor; 

S if( lOpenOutpufFileO ) goto FailureExit; 

/♦ if (IRe^artStateO) goto FailureExit; */ 

time( ftstartTime ); 

Visual((stdeiT,''B^in reading files: %s",ctime(&startTime))); 

I* Let's actually do something now ^1 
10 if (IReadEverythingO) goto FailureExit; 

time( &finishTin)e ) ; 

Visual((stderr/Be£in filtoing: %s*',ctime(&finishTime))); 
#ifO • 

DumpBitSet(Gpod_Products,nY_01 ,nY_02); 
15 DumpBitSet(Dead_Products,n Y^Ol ,nY_02); 

#endif 

if (!FilterProducts(&numFiltered)) 
goto FailureExit; 

#ifO 

20 DumpBitSet(Dead_Products,nYJ)l.nY_02); 
#cndif 

tinie( &finishTime ); 

Visual((stderr, "Filtered out %d out of %d possible productsXn-.numFiltered, nY_02 
*nY_01)); 

25 Visual((stderr/B^in selection: %s'',ctime(&finishTime))); 



if (lUserAborted && !SelectEverythingO) goto FailureExit; 
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CloseOu^MitFUeO; 
time( &finishTinie ); 



tbtamme = finishUme - starfllroe; 
if(Jtotanime )-totamine = 1; 

5 Visual((stderr, "Created %d Selections in nPnx:essed )); 

Visual((stderr/%d Hours, %d min, %d secsVn", 
totanime/(60*60), 
(totarrime%(60*60))/60, 
(totarrime%60))); 

10 Visual((stderr,*'Each comparison required %.8f seconds to calculateNn", 

(totaITime/((double)(nProcessed?nProccssed: 1))))); 



Visual((stderr,"EndJ3uick Select Computation: %s",ctime(&fmishTime))); 

UserAborted ? exit(ErrorExit) : exit(GoodExit); 

SyntaxEnor: 
15 cxit(l); 

FailureExit: 

exitCErrorExit); 

} 

int ReadEverythingO 
20 { 

diar ♦hold; 
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int i; 

/* because bilure hoe means end program ran, no effort to clean up 
memory on error is included. */ 

if (IPrefixFoifnks ) letum 0; 
5 ifOWarmUpO) return 0; 

if (!(hold = UTL_STR_CONCATENATE(PrefixForFaes,".Xl.pro'))) return 0; 
if (! (InputSourceFile = fopen(hold."r"))) return 0; 

if(!(nY_01 = CountLinesO)) return 0; 

if (! (Y_01 = fmt *•) UTL_MEM_ALLOC(sizeof(int *) * nY_01))) return 0; 
10 if (!(cY_01 = fmt •) UTL_MEM_ALLOC(sizeof(int ) • nVjOl))) return 0; 
if (!OV_01 = Cint •) UTL_MEM_ALLOC(si2eof(int ) » nY_01))) return 0; 

if ( NumRangeFidds ) 
{ 

if(!(RangeValues_Y01 = (float •*) UTL_MEM_ALLOC(si2eof(float •) • nY_01))) 
15 return 0; 
} 

if( NumOneOfPidds ) 
{ 

if (!(OneOfValues_Y01 = Ont •*) UTL_MEM_ALLOC(sizeof(int •) • nY_01))) 
20 return 0; 

} 

for(i=0;i<nY_Ol;i++) 
{ 

if (!GetNextUne(cY_01+i,Y_01+i, RangeValues_Y01 + i. OneOfVaIues_Y01 + i )) 
25 return 0; 

} 

fcloseOriputSourcePile); UTL_MEM_FREE(hold); 
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if (!(hold = UTL_STR_CONCATENATE(PrefixForFiles,",X2.pio"))) return 0; 
if (! (InputSouiceFile = fopai(hold."r"))) return 0; 

if (! (nYJC = CountUncsO)) return 0; 

5 if (! (Y_02 = (int UTL_MEM_ALLOC(si2eof(int *) * nY_02))) letum 0; 
if (!(cY_02 = Cuit ♦) UTL_MEM_ALLOC(sizeofCmt ) • nY_02))) return 0; 
if (!riY_02 = (int *) UTL_MEM_AtLOC(sizeoffint ) * nY_02))) return 0; 

if( NumRangeFidds ) 
{ 

10 if(!(RangeValues_Y02 = (float **) UTL^MEM>LLOC(sizeof(float *) * nY_02))) 
return 0; 

} 

if( NumOneOfFields ) 
{ 

15 if (!(OneOfValues_Y02 = (int **) UTL_MEM_ALLOC(sizcof(int *) * nY_02))) 
return 0; 

} 

forCi=0;i<nY_02;i++) 
{ 

20 if (! GetNextLine(cY_02+i,Y_02+i,RangeValues_Y02 + i, OneOfValues_Y02 + i )) 
return 0; 

} 

fcloseanputSourceFUe); UTL_MEM_FREE(hold); 

25 if(!Good_l) /* not reloaded •/ 
{ i= (nY_0l+31)/32 *4; 

if (!(Good_l = (int •) UTL_MEM_ALLOC0))) return 0; memset( Good^l.O.i); 

if (!(Good_2 = Ont *) UTL^MEM_ALLOCCi))) return 0; memset( Good_2,0,i); 

i= (nY_02+31)/32 • 4; 
30 if (!(Dead_l = (int *) UTL_MEM_ALLOC(i))) return 0; inemset( Dead_l,0,i); 

if (!(Dead_2 = (int •) UTL_MEM_ALLOCfi))) return 0; memset( Dead_2,0,i); 
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i= (nY_01*nY_02+31)/32 • 4; 

if (!(Good_Products = (int *) UTL_MEM_ALLC)C(i))) return 0; 

maiiset( Good_Products,0,i); 
if (!(Dead_Products = (int *) UTL_MEM_ALLOCCi))) return 0; 
5 memsetC Dead_Products,0,i); 

SomeLeft = nY_01 • nY_02; 

} 

return 1; 

) 

10 int WarmUpO 
{ 

int i; 

for(i=0;i<256;i++) nbits(i] = (i&l) + 0&2)/2 + 0&4)/4 + (iScSyS + 
fi&16)/16 + Ci&32)/32 + (i&64)/64 + Ci&128)/128 ; 
15 for Ci=0:i<8;i++) setbits[i] = ( 1 < < i) & 255; 

return 1; 
} 

int CountlinesQ 
{ 

20 int i; 

char *foo; 

i=0; 

whUe ( -1 != UTL_SCAN_GETS( InputSourceFile. "W". &foo)) i+ + ; 

iewind(InputSourceFile); 
25 return i; 

} 



static int FilterProducts(nun)Filteied) 
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int *nuinFilte:ed ; 
{ 

int numProducts ; 
int i ; 
S int indexl ; 
intindex2; 



'■'numFilt^ = 0 ; 

numProducts nY_02 * nYjOl ; 

for ( i = 0 ; i < nunfiPioducts ; i++ ) 
10 { 

indexl = i / nY_02 ; /*Y_01 index */ 
index2 = i % tiYJOl ; /*Y_02 index */ 

if ( NotWithinScalarRange(indexl» 
index2, 

IS NumRangeFields , 

RangeValues^YOl , 
RangeValues_Y02 , 
RangeFields, 
NumOneOfFields , 

20 OneOfValues^^YOl . 

OneOfValues_Y02 , 
OneOfValues )) 

{ 

FlagPnxluct(Dead_Products,0,0,i); 
25 SomeLeft-; 

♦numFiltered += 1 ; 

if (Dd>ugLevel) 

fiprintf(stderr, "Filtered %d %d %d\n",i4- 1, index 1 + 1, index2+l); 



WOOT/27599 



PCT/US97/01491 



341 



} 

letum 1 ; 

} 

5 /* 

**+!: 
** 

** 

Abstract : Function will read the next line pointed to by global 
10 InputSouiceFile and parses out finger print and any other 

** scalar attributes we are filtering on, 

** 

Usage 

15 ** 

*♦ Returns : 1 on success, 0 for failure. 
** 

** Algorithms : None. 
20 ** Revision History : 



♦/ 

25 int GetNextLine( pCard, pFP, rangeValues. oneOfValues ) 

int *pCard; /*(OUT) returns the cardinality of the finger print */ 
int **pFP; /*{OUT) returns the finger print */ 
float **rangeValues;/*(OtJT */ 
int ♦*oneOfValues;/*(OUT */ 

30 { 
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char ♦line; 
int words; 
int i ; 
int j ; 
5 char *q) ; 

if (-1 == UTL_SCAN_GErS( InputSourceFilc, "W*, "r. &line)) return 0; 

ReadIJneAttributes(line, 

NumRangeFields, 
rangeValues, 
10 RangeFidds, 

NumOneOfFields, 
oneOfValues, 
OneOfValues) ; 

line = strstr(line/fixard=")+sulen("fipcard=^^ 
15 if (! lJTL_STR_EXTRAeT_INT(line, pCard)) return 0; 
line = strstr(line,"^=")+strlenC^p="); 
UTL_SCANjrOKENIZE(lme,';\'\\'); 
UTL_SCAN_.TOKENIZEaine/ > \'\V); 
words = strien(line) / 8; /* must have 32 bit int multiple */ 
20 if (IWordsPerFingerprint) 

{ BytesPerPingcrPrint == wordsM; 
query = fmt *) UTL_MEM_ALLOC( BytesPerPingerPrint); 
WordsPerPingerprint = words;} 
if ( words ! = WordsPerPingerprint) return 0; 
25 ♦pFP = (mi *) UTL_MEM_ALLOC(words * si2eof(int)); 
for (words=0;words< WordsPerPingerprint; words+ +) 

{ 

memcpy(next8 ,line,8) ; 
line +=8; 
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sscanf(next8/%8x", •pFP+ words); 

} 

r^rn 1; 
} 

5 int IntasectQuay( plntr, pFP) 
int *plntr, **pFP; 
{ 

uimgned char *ptr ,*qtr, 
int i, count; 

10 ptr = (unagned char *) *pFP; 
qtr = (unsigned diar *) query; 
for(count=0, i=0; i <WpnlsPerFingerprint*4;i++) 
count += nbitsi *ptr++ & *qtr++]; 

*plntr = count; 
15 return 1; 
} 

int SdectEverythingO 
{ 

in^ cq^ <L-'<>i Oi» U carhold, inthold, onion, intsc; 
20 double max; 

while (nProcessed < NoMorehitsPlease && SomeLeft ) 
{ 

/* 

*♦ What we would like to do is first select any selections that were found 
25 ** in a previous run. 
♦/ 

if ( !InputSource 1 1 !( c_query = SelectFromInputFile(query)) | 
{ 
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if (! (c^qucry = Selectlt(query) )) 



return 0; 



} 

nProoessed++; 
SomeLeft-; 

I* then 2Zf its neighbors and continue! */ 

forCi^0;i<nY_01;i++) 

if (! IntosectQueiyC iY_01 +i. Y__01 +i )) return 0; 
forO=0;i<nY_02;i++) 

if (! IniersectQuery( iY_02+i,Y_02+i )) return 0; 

cqt = fIoor( (double) c_query / Tanimoto); 

q_lo = floor( (double) c^query * Tanimoto - (double) BitsInAbsentia); 
q_hL ^ ceil( (double) ( cjquery +BitsInAbsentia) / Tanimoto); 

if(DebugLevel) 

DumpValues(nY_01,nY_02); 



carhold = q_lo • cY^OlOl; 
inthold = q^lo - iY_01[i]; 
for(j=0y<nY_^02j + +) 

{ 

if (UserAborted) return 1; 

if (cY_02[jl > cqt) { continue;} 
if (cY_02[j] < carhold) { continue; ) 

if Onthold > iY_020]) { continue; } 

/• 

*"* Do not need to look at it, if it has already been used, eliminated. 



foi(i=0;i < nY_01 ;i + +) 
{ 

if (cY_01[i] > cqt) 



continue;} 
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*/ 

if (TcstBit(Dead_Products,i*nY_02+j)) continue; 

ActuaUyCompute( i, jumion, &intsc, &niax); 
if (max > = Tanimoto) 

FlagProdiict(Dead_Pioducts, ij, 0); 

SotneLeft— ; 
if (DebugLevd) 
{ 

10 ^ruitf(stderr."\nZiq)ping %d %d",i+lj+l); 
DunipBitSet(Dead_Products,n Y_01 ,n Y_02); 
} 

} 

} /• Y_02 loop */ 
15 } /* Y_01 loop •/ 

} /* while still stuff left */ 
return 1; 

} 

int TestBit(bitset, bit) 
20 int *bitsrt, bit; 
{ 

int what, this; 
unsigned duu- *bytes; 

bytes = (unsigned char *) bitset; 

25 what = bit % 8; 
this = bit / 8; 

return (bytes[this] & setbits[what] ); 

} 
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int FlagProductCTheProducts, index l»index2, this) 

int *TheProducts; 

int indexl,index2, this; 

{ 

5 int what; 

unsigned char ^Products; 

/♦ if (DebugUvel) 

printf(*%d %d, %d, %x\n'',indexl,index2,this,TheProducts);*/ 
Products ~ (unsigned char ^) TheProducts; 

10 if (!this ) this = indexl*nY_Q2 + index2; /* bit index ♦/ 
what = this % 8; 
this /= 8; 

Products[this] | = setbits[what]; 
return 1; 
15 } 

int DumpBitSet(bitSet,num YOl ,numY02) 
int *bitS^ ; 
int numYOl ; 
int nuniYOZ ; 
20 { 

int i , j ; 

unsigned char *Products = (luisigned char *)bitSet ; 
int pos ; 
int byte ; 
25 int bit; 
int index 1; 
int index2 ; 

fprintf(stderr/Vn Y_02 — \n 
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for ( i = 0 ; i < nY_02 ; i++ ) 

fprintf(stderr," %3d 
fprintfi(stderr, "\n \n 



5 



for ( i = 0 ; i < (numYOl * nuinY02) ; i++ ) 
{ 

ifidexl = i / numYQ2 ; 
10 index2 = i % numY02 ; 

byte = i / 8 ; 
bit == i « 8 ; 

if ((index2 0) ) 

^rintf(stden-/\n%3d |",indexi + l); 
15 fjprintfCstderr," 963d ".(ProductsCbytel & setbits[bit])?l:0 ); 

} 

fprintf(stdaT,"\n \n"); 



int DuinpVaIues(num YOl ,nuin Y02) 
20 int liumYOl ; 
int numYQZ ; 
{ 

intij; 
int pos ; 
25 int byte ; 
int bit ; 
int index] ; 
int index2 ; 
int onion ; 
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intintsc ; 
double max ; 

lprintf(stderr,"\n Y_02 ^Vn 

"); 

5 for ( i » 0 ; i < nY_02 ; i++ ) 
fprintf(stderr," %56 
Qnintf(stden-,"\n- \n 

■); 

10 for( i = 0 ; i < (numYOl * numY02) ; i + + ) 

{ 

indexl = i / numY02 ; 
index2 = i % numY02 ; 



ActuaIlyCoin|Hite( indexl, index2» &onion, &intsc, &max); 

15 if ((index2 == 0) ) 

fprintf(stdcrT,"\n%5d }\indexl+l); 
fprintf(stdeiT," %0.3f -.max); 

} 

fprintf(stdOT/\n \n"); 

20 } . 

int FlagReagentfTheReagent, size, index) 
int *TheReagait; 
int size, index; 
{ 

25 int what, this; 

unsigned char ^Reagent; 

Reag^t = (unsigned char *) TheReagent; 
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what — index % 8; 
this = index / 8; 
Reagent[this] | = setbits[what]; 
return 1; 

5 } 

int SelectFromInputFile(qucry)_ 

int *query ; 

( 

static int firsTTime = 1 ; 
10 static FILE *fp = (FILE *)NULL ; 

unsigned diar *p, *q ; 
int index!; 
intindex2; 
int index ; 
15 dmr *linc ; 
diar *cp ; 

unsigned char ^queiyPtr ; 

if (firstTime) 
{ 

20 if ( !(fjp = fopcn(InputSouice/r*))) 

goto UnabkToOpenFile ; 
firstTime = 0 ; 

} 

if (-1 == UTLJCAN_GETS( fp, '\ &line)) return 0; 

25 if(!(cp-strtokaine,""))) 

goto UnableToParseUnc ; 
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if ( !( cp = strtok(NULL," ")) ) 

goto UnableToParseLine ; 

index! «= atoi(q>) - 1 ; 

if (!(cp = strtok(NULL,* •)) ) 

goto4JnabIeToParseLtne ; 
index2 = atCM(q>) - 1 ; 



if (( indexl < 0 ) 1 1 { index2 < 0 ) ) 
goto UnableToParseline ; 

/* 

If we are reading back in a selection that might have already been filtered 
out we better adjust our counts. 

*/ 

if (TestBit(Dead_Products,indexl*nY_02+index2)) 
SomeLeft+-f-; 

p = (unsigned char *) YjOlpndexl]; 
q = (unsigned char *) Y_(J2[index2]; 

c_qucry = 0; 

queryPtr = (unsigned char ♦)query ; 

for (index =0;index < BytesPerFingerPrint;index + + ,queryPtr+ +) 
{ 

♦queryPtr = *p++ { *q++ ; 
cjjuery += nbits[*queryPtr & 255]; 

} 

OutputThisHit(indexl,index2); /* both print it and note it in bitsets */ 
return c_query; 
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UnableToParseline : 

^|Hintf(stderr, "Unable to Parse %s\n",line); 
return 0 ; 



5 



UnableToOpenFile : 

4>rintf(stderr/Unable to opra file %sVn*,InputSouice); 
return 0 ; 



/* Here the intent is to select the next compound "intelligently". 
We try to maximize use of <wie or the other reagent. 



int Selectlt(query) 
int *quay; 
{ 

int i j; 

15 if (Whatl < 0) {Gi3bRandom( &i, &j, query); goto out; } 

switch (WhatFirst[0]) 
{ 

case '0': 



10 */ 



20 



GrabRandom( &i, &j, query); 
break; 



case 



GrabThis( &i, &j, 1. query); 
break; 



case '2*: 



25 



GrabThis( &i, &j, 2, query); 
break; 



} 



out: 

Ou^tThisHit(i J); /* both print it and note it in bitsets */ 
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return c^qucry; 

} 

int GiabThu( pi, p2, type, fp) 
int^l , v. type. *f9; 
5 { 

undgned char *p, ♦q, *pio; 
intind^; 

switch (type) 
{ 

10 case 1: 

if (!findOnc(Dead_Products, Whatl»nYJ)2, 1, nYj02) && 
]fifidOiie(Dead_Products, What2 , nY_02, nYJ)l) && 
!GiabRaiidoin( pi, p2, fp) ) return 0; 

break; 

IS case 2: 

if (!findOne(Dead_Products, What2 , nY_02, nY^Ol) && 
!findOne(I>ead.Products, Whatl*nY_02. 1, nY_02) && 
!GrabKandom( pi, p2, fp) ) return 0; 

break; 

20 } 

*pl = Whatl; *p2 = What2; 

pro = (unsigned char *) fp; 

p = (unsigned char ♦) Y_01 [What]]; 

q = (unsigned diar *) Y_02[What21; 

25 c_query = 0; 

for (index -0;index <BytesPerFingerPrint;index+ + ,pro+ +) 
{ *pn) = *p++ I *q++ ; 

c_query += nbits[*pro & 255]; } 
return 1; 

30 } 
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/* This can be done more efficiently when we KNOW we are walking a vector */ 
int findOne(bitset,start,incr,niax) 
int *bitset, start, incr, max; 

{ 

5 inti; 

for 6=0;i<roax;i++, start += incr ) 

{ 

if ( TeslBit(bitset, start)) continue; 
WhatI = start / nY_02; 
10 What2 = start % nY^02; 
letum 1; 

} 

return 0; 

IS int GrabRandbin( pi, p2, fp) 
int *pl. •p2, 'fp; 

{ 

int index, sum; 
int valuel, value2; 
20 unsigned char *p, *q, *pro; 

p = (un^gned char Dead_Products; 

index = UTL^MATH^RANDO * SomeLeft +1; 

valuel = sum = 0; 

while (sum < index) 
25 {sum += nbits[ --(♦?++) & 255]; 
valuel += 8; } 

p -= 1; sum -= nbitst -(*p) &255 ]; valuel -= 9; value2 = (-(*p) & 255); 
while (sum < index) 
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{valuel + +; 

if ( valu^ & 1) suin++; 
valu^ - valub2 > > 1; } 

value2 = valuel % nYJ)2; 
5 valuel /= nY_02 ; 

*pl = Whatl = valuel; 
*p2 = What2 = value2; 

pro = (unsigned char *) fp; 
p = (unsigned char *) Y_01[Whatl]; 
10 q = (unsigned char *) Y_02[What2]; 

c_query = 0; 

for Ondex =0;index < BytesPerFingerPrint;index+ + ,pro+ +) 
{ *pro = *p++ I *q++ ; 

c_query += nbits[^ro & 255]; } 
15 return 1; 
} 

int ActuallyCompute( indexl, index2, pUnion, pintersection, pMaxTan) 
int indexl, tndex2, ^Union, ^Intersection; 
double ^pMaxTan; 
20 { 

int i, product; 

un^gned diar *hl, *h2, *hquery; 

/• if (EWjugLevel) 

fprintf( stderr," ActuallyCompute at %d , %d\n", indexl, index2);*/ 

25 hi = (unsigned char *) Y_01[indexl]; 
h2 = (unsigned char *) Y_02tindex2]; 
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hqu^ = (unsigned char *) query; 

♦pUnion = ""^Intersection = 0; 

for( i=0; i<WordsPerFingerprintM;i++) 

{ 

5 product = *hl + + | *h2++ ; 

♦pUnion += nbits[ product | *hquefy]; 
^Intersection + = nbits[ product & *hquCTy++]; 

} 

/* if (DebugLevd > 9) fprintf(stdefr/%d / %d %6,3An", 
10 ^plnt^section, ^Union, 

(double) *pIntersection / *pUnion); */ 
return (♦pMaxTan = (double) (^^Intersection + BitsInAbsentia) / (double) *pUnion); 
} 

int Outpurrhisirit( index 1, index2) 
IS int indexl, ind^; 
{ 

int which; 

vMch = indexl*nY_02+index2; 

fiprintf(Ou^utFile,"%s%d %d %d\n\ PrefixForFUes, which+1, 
20 indexl + 1 ,index2+I); 

FlagProduct(Good_Products,0,0, which); 

FlagProduct(Dead_Pioducts,0«0« whidi); /* can only be selected once */ 

/* note use of reagents; this is slighdy wasteful of time ^/ 
FlagReagent(Good_l, nY_01, indexl); 
25 FlagReagent(Good_2, nY_02, index2); 



return 1; 

} 
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PAGE INTENTIONALLY LEFT BLANK 
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A ppendix "J" 

•/ 

5 /* dbcslnjqstop */ 

*/ 

/*+c 

10 * This program evaluates topomeric shape ^milarity vs cSLNs 
■* based on prq>rocessing of the substituent reagents. U»ng this, it 

* selects a diverse set of products while trying to maximize use of 

* some groups. A key assumption: D"2(i j) = Drl^(i j) + Drl'^OJ) 

* i.e. the distance between products from any one reaction is the 
IS * root mean square distance of their corresponding reactants. 

* To be added: restart c£4>ability and reagent blackout. 

* The input files, one per XI, X2, have one line per 
20 * structure and contain the element ••tp=r2z;" where 

* the terminating may also be * > 

* The hex value of fp is the condensed rq^resentation of a CoMFA grid 

* value, 4 bits (one hex char) per grid, with int^retation as in 

* routine WhatsTheDifferenceQ. 
25 * 

* The resultant file contains one line per hit, of the form 

* Yl Y2 D Dl D2 

* where Yl = index of the substituent in Xl.piT file 

* Y2 = index of the substituent in X2.piT file 
30 * D = apparent Tanimoto similarity 

* D1,D2 = RU R2 distance 

*dbcsln_stop -prefix <hame> -distance <real> -prefer <what> -app^d 
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* -maxhits <int> -ouq>ut <naine> +d€bug 
* 

* Options: 
* 

5 * -prefix name - name is the prefix for a s^ of 2 files 

* with extensions .Xl.piT .3C2.piT 

* ; files have fingeri»ints 

* (someday) will rdaad ftom prefix.RELX)AD if present 

10 -distance dmin - dmin is the closest allowed approach 

* (default is 80) 

* -prefer - one of Rl ,R2 else random. Rl maximizes use of Rl 
IS * -maxhits max - stop when max hits arc found (default infinity) 

* -output filename - specifies the output file for the hit info 

* by default results are sent to stdout. 

20 * -append - append results to an existing output file 

* By default an ou^t file is overwritten. 

* 4-debug - writes irrelevant info to stderr 

25 ♦ This flag forces the display of all 

* options 

/ 

30 /* use 3db 

* dbcc dbcslnquickselect.c -o dbcslnquickselect */ 
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#include <stdio.h> 
#include <signal.h> 
include <ctype.h> 
^include <unistd.h> 
5 ^include <string;h> 
#include <sys/staLh> 
#include <inath.h> 
iS^lude *parseopLh* 

10 ftnclude "utl^mcm.h" 
^include "utl_file.h" 
ftnclude "utl^math.h" 
ftnclude "ct.h" 
^include "ct^expr.h" 

15 jHndude "ctj)it)to.h" 

include "import^proto.h" 



#define GoodExit 0 
#defuie EnorExit 1 
#define Visual(s) { 



fpiintf s; } 



20 static FII£ 
static char 



♦Ou^lFUe; 
n)utputFaeName; 



static char 
static int 
static int 



♦WhatFirst; 
Whatl = -i; 
Whal2; 



2S static char 
static char 
static FILE 



*PrefixForFiies; 
♦InputSource = 0; 
*InputSourceFile; 



/* Code presumes that an int is 32 bits, ASCII-ed into %.8x format */ 
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Static unsigned char 
static unsigned char 
static int 
static int 
Static double 
static double 



360 

**Y_01; /* fingerprints */ 
**Y_02; /* " */ 
nYJ)l; /* number of structures */ 
nY_02; I* * */ 
♦iY_01; /* int^section count of fprints */ 
♦iY^02; /* • */ 



static int 
static int 
static int 
10 static int 
stadc int 
static int 



*GoodJ; 

*Good^2; 

♦DeadJ; 

♦Dead_2; 

*Gopd_Products; 

*I>ead_Products; 



static double 
static double 
IS static int 
static int 



boundary[16]; 
Dist[16][16]; 
setbits[8]; 
nbits[2S6]; 



static double 
static int 
static int 
20 static int 
static int 
static int 



Distance = 80.0 ; 
AppendToOu^utFile = 0; 
BytesPerFingeri*rint[2] ; 
NoMorehitsPiease = 999999999; 
DebugLevel; 
UserAborted; 



static int 
static int 
25 static char 



nProcessed = 0; 
SomeLeft; 

next8I10] = "01\0"; 



static struct ParsdOptions OptionsQ = { 
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*** DO NOT MOVE ENTRIES IN tfflS TABLE. 
END. 
*•*/ 

{■prefix*, ParseOptString, &PrefixForFiles, 
5 "Prefix for ail input files" }, 

{"distance", ParseOptEtouble, &Distance, 
"Topomer distance (typically 75 to 100)" }, 

{"maxhits", ParseOptlnt, &NoMorehitsPlease, 
"Maximum number of hits before stopping" }, 

10 {"input". ParseOptString, &InputSource, 

"File fit>m which queries will be read( default stdin). "}, 

{"output", ParseOptString, &OutputFileName, 
"File to which hit info will be written. "}, 

{"prefer", ParseOptString, ftWhalFirst, 
15 "One of Rl, R2 to maximize us of."}, 

{"append", ParseOptNoArg, &AppendToOutputFile, 
"Use -append to append results to an existing fUe" }, 

{"debug", ParseOptBoolean, &DebugLevel, 

"Use +debug to enable debugging messages" }, 

20 }; 

int UBS_^OUTPUT_,MESSAGE0 { return 0; } /* just for compiling OK */ 
int UIMS2_WRITE_PHOTO0 { return 0; } 

int lowercase (s) char *s; {while (*s) { if isupper(*s) ♦s = tolower(*s); s+-f;}} 
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Static void UserHitControICQ 

* This function is the signal handier for user initiated program terminatton. 
5 * It*s only role is to set a flag indicating that the user wishes to abort the ptognun. 



10 { 

UserAborted = 1; 

} 



static int ParseArguments( argc, aigv ) 
/*+! 
15 * 

* This function parses the command line arguments. 

^ Rettuns: 1 on a successful command line parse, 0 othmvise. 

20 * Warnings: 

«t ■ 

* Brrors: 

* See Also: 
25 * 

* 
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argc; 
**aigv; 

int naigs, 

noptions ~ sizeof( (^dons )/sizeof(C^tions[0]); 

OutputFile = stdout; 

nargs = UTL_PARSE_OPT( argc, argv, noptions, Options ); 
10 if( Inaigs ) goto SyntaxError; 

if (WhalFirst) 

{ if (strstr(WhatFirst,*Rr)) WhalFinrt[0]='r; 

if ($trstr(WhatFirst,"R2'')) WhatFirst[0]='2*; 
}else{ 

15 WhatFirst=UTL_MEM>IJJ3C(2); WhatFim[01='^ } 

return 1; 

SyntaxError: 

return 0; 

} 

20 static int OpenOutputFileQ 

* Returns: 1 on sucesss, else 0 
25 •/ 



int 
char 
5 { 
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{ 

char *insg; 
FILE *fp; 

OutputFile = stdmit; 
5 if( OulputPileName ) 

{ 

/* 

** We need to create output files under the own^ship of the REAL user not the 
•* EFFECTIVE user. This only applies if setuid options are activated. 
ID */ 
{ 

struct Stat statBuff ; 
int uid ; 
int euid ; 



15 uid = getuidO ; 

eiud getaiidO; 
siat(0u^tFileName, &statBuf{); 

/* 

** There are two cases 
20 ** (1) the file to ou^t to ousts 

Use the ownership of the current own^ of the file or if you cant do that 

** do not do anything. 

** (2) The fflc is being created. 

** use the ownen^p of the REAL user. 
25 */ 

if ( access(OuQ)utFileName, F_OK) == 0 ) 
I /* If the file exist s^d the real user is the owner of the file */ 
if ( statBuff.st_uid = = uid ) 
s^euid(u]d); 

30 I 
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else 

{ /* Create the file as the REAL user */ 
seteuid(uid); 

} 



Ou^tFile == fopen( OuQiutFileNanie, (AppendToOu^tFile?"a":-wb")); 

if(!OutpulFUe){ 

fprintf(stderr/Error: Failed to open output file \''%s\*\n", 
OutputFileName ); 
10 goto ErrorRetum; 

} 

} 



return 1; 



ErrorRetum: 
IS return 0; 

} 



static void CloseOuq)utFileO 
20 * This function closes the ou^ut file. It is included just for cleanliness. 
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{ 

fclose( OutputiPile ); 

} 



5 int inain( ai]gc, aigv ) 
/•+E 

* 

int argc; 
10 char **argv; 
{ 

long startTitne, 
totatTime, 
finishTime; 

15 

*** Establish handler for a user interrupt. 

signal( SIGINT, UserHitControlC); 
#ifdef SIGHUP 
20 signal( SIGHUP, UserHitControlC); 

#endif 

if( !ParseArguinents( aigc» argv ) ) 
goto SyntaxError; 



if( fOpenOutputFileO ) goto FailureExit; 
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/* if (!RestartStateO) goto FailurcExit; */ 
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tiine( &startTime ); 

Visiial((stden-/Begin reading files: %s*',ctime(&staitl1ine))); 

Let^s actually do som^ng now 

if (IReadEv^ythingO) goto FailuieExit; 
time( &finishTlme ); 

Visual((stdOT/Begin selection: %s*',ctime(&finishTime))); 
if (lUserAborted && !SelectEverythingO) goto FailureExit; 
aoseOutputPileO; 
tiine( &fimshTinie ); 

totalTime = finishTime - startTinie; 
if( itotalTime ) totalTime = 1; 

Vi$ual((stderr, "Created %d Selections in nProcessed )); 

Vi$ual((stderr/%d Hours, %d min. %d secsXn", 
totamme/(60*60), 
(totarrime%(60*60))760, 
(totalTime%60))); 

Visual((stdOT/Each comparison required %.8f seconds to calculate\n", 
(totairime/({double)(nProcessed?nProcessed: 1))))); 



Visual((stderr,"End Quick Select Computation: %s-,ctime(&fmishTime))); 
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UserAborted ? exit(ErrorExit) : extt(GoodExit); 

SyntaxEnor: 
cxiKD; 

FailureExit: 
5 _eidt(&TorExit); 

} 

int ReadEverythingO 
{ 

char **'hold; 
10 int i; 

/* because failure here means end program run, no effort tt> clean up 
memory on error is included. *l 

if (IPrefixForFiles ) return 6; 

if (IWhat^FheDifferenceO) return 0; 

15 if ('(hold = UTL_STR_CONCATENATE(PrefixForFiles.".Xl.piT"))) return 0; 
if (! (InputSourceFile = fopenOiold/r"))) return 0; 

if (! (nYjOl = CountUnesO)) rrtum 0; 

if (! (YjOi = (unsigned char ••) 

UTL_MEM_ALIjOC(sizeof(unsigned char *)*nY_01))) return 0; 
20 if (!CiY_01 = (double *) UTL_MEM_ALLOC(sizeof(double ) * nY_01))) return 0; 
forO=0;i<nY_01;i++) 
if (I G«NextLine( Y_01 +i, 0 )) return 0; 

fclose(InputSourceFile); UTL_MEM_FREE(hold); 



25 if (!(hold = UTL_STR_CONCATENATE(PrefixForFiles,".X2.prT*))) return 0; 
if (! (InputSourceFile = fopen(hold,"r"))) return 0; 
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if (! (nY_02 = CountlinesQ)) rrtum 0; 

if (! (Y_02 = (unsigned char **) 

ini._MEM_ALLOC(sizeof(unagned diar *) * nY_02))) return 0; 
if (!CiY_02 = (double *) UTL_MEM_ALLOC(sizeof(doublc ) * nY_02))) return 0; 
5 for fi =0;i < n Y_02;i+ +) 

if (! GetNextLine( Y_02+i , 1)) return .0; 

fclose(InputSourceFile); UTL_MEM_FREE(bold); 

10 ifOGoodJ) /• not reloaded •/ 
{ i= (nY_01+31)/32 *4; 

if (!(Good_l = (int •) UTL_MEM_ALLOC(i))) return 0; memset( Good_l,0,i); 

if (!(Good_2 = (int •) UTL_MEM_ALLOC(i))) return 0; memset( Good_2,0,i); 

i= (nY_02+31)/32 * 4; 
15 if (!(Dead_l = (int *) UTL_MEM_ALLOCCi))) return 0; memset( Dead_l,0,i); 

if (!(Dead_2 = (int •) UTL_MEM_ALLOC(i))) return 0; memset( Dead_2,0,i); 

i= (nY_01*nY_02 +31)732 * 4; 

if (!(Good_Pfoducts = (int *) UTL_MEM_ALLOC(i))) return 0; 

niemset( Good_Products,0,i); 
20 if (!(Dead_Products = Ont *) UTL_MEM_ALLOCfi))) return 0; 

memset( Dead_Products,0,i); 
SoraeLefi = nY_01 * nY_02; 

} 

letum 1; 
25 } 

int WhatsTheDifference() 
{ 

int i, j; 

#define pow2(a) ( (a) * (a) ) 



30 /* the assignment of codes is based on the following (from genjils.c): 
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Static fiptcuto»[16] = {9999., 0., 2., 4., 6., 8., 10., 12., 
14.. 16., 18., 20., 22., 24., 26., 30. }; 

*/ 

bouodaryEO] = 9999.; /" missing data ought never to occur. */ 
5 boundary[l] = -0.1 ; 

forfi=2:i< 15;i++) 
boundary[i] = 2*i-3; 
boundary[15] = 30.0; /* this is a Sboep curve with a cutoff at 30! **/ 
for (i=0;i< 16;i++) for (j=0-j< 16a++) 
10 Dist(i]Q] = pow2( boundaiyfi] - boundaryQ]); 

for (i=0;i<256;i++) nbits[i] = (i&l) + (i&2)/2 + (i&4)/4 + (i&8)/8 + 

(i&16)/16 + (i&32)/32 + (i&64)/64 + a&128)/128 ; 
for fi=0;i < 8;i+ +) setbitsD] = ( 1 < < i) & 255; 

Distance *= Distance; /* want to test D'^2 directly */ 

IS return 1; 
} 

int CountlinesQ 
{ 

int i; 
20 char *foo; 

i=0; 

while ( -1 != UTL_SCAN_GETS( InputSourceFile, "W", "r, &foo)) i4-+; 



25 



rewindGnputSourceFile); 
return i; 
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int GctNextUne( pFP, index) 
uimgned char **pFP; 
int index; 

{ 

5 char *line; 
int words, hold; 

if (-1 == UTL_SCAN.GETS( InputSourccHle, "W, "#% Aline)) return 0; 
line = strstr(Une/^=")+strlenCq)="); 
UTL_SCANlTOKENIZEaine/;'/\\*); 
10 UTL.SCAN^TOKENIZEOine/ > \*\V)\ 

words = strlenOine) / 2; /♦ must have 8 bit bytes */ 
if (!BytesPerFingerPrint[indexl) 
{ BytesPerFingerPrint[index] = words; 

15 } 

if ( words != BytesPerFingerPrintfindex]) return 0; 
*pFP = (unsigned char *) UTL_MEM_ALLOC(words); 
for (words =0;words < BytesPerFingerPrintlindexJ; words+ +) 
{ 

20 memcpy(next8Jine,2); 
line +=2; 

sscanf(next8,-%2x*. Ahold); 
*(*pFP+words) = (unsigned char *) hold; 

} 

25 return 1; 
} 

int IntersectQuery( plnlr, pFP, query, index) 
double *plntr; 

unsigned char **pFP, **query; 
30 int index; 
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{ 

unsigned char *ptr ,*qtr; 
int i; 

double count; 

5 ptr = (unsigned diar *) *pFP; 
qtr = (unagned char *) *query; 

for(count=0.0, i=0; i<BytesPerFingttPrintOndex];i++. ptr++, qtr++) 
count += Dist( *ptr4tOxOF ][ ^qtr&QxOF ] 

+ DistI(*ptr&OxFO) >> 4][{*qtr&0xF0) >> 4] ; 

10 ♦pintr = count; 
retumJ; 

} 

int SelectEverythingO 
{ 

15 int cqt, q_lo, q_hi, i, j, carhold, inthold, onion, intsc; 
double max; 

wtule (nPnKessed < NoMorehitsPlease && SomeLeft ) 

{ 

if (! SdectltO ) return 0; 

20 nPiocessed+.+; 
SomeLeft-; 

/* then zap its neighbors and continue! */ 

for fi=0;i<nY_01;i++) 
25 if (! IntasectQiiery( iY_01 +i,Y_OI +i, Y_Ol + Whatl.O )) return 0; 
for(i=0;i<nY_02;i++) 
if (! IntersectQuery( iY_02+i,Y_02+i, Y_02 + What2,l )) return 0; 

forO =0;i < nY_01 ;i + +) 
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{ 



if CiY_01[i] > Distance) { 



continue;} 



f6rO=Ou<nY_02y++) 
{ 

if (UserAborted) return 1; 



10 



IS 



if C1YJO2O] > Distance) { contini»; } 

if ( iV.Oili] + iY_0201 < = Distance && 
!Tesfflit(Dead_Products,i*nY_02+j) ) 

{ 

FlagProduct(Dead_Products, ij, 0); 
SomeLeft-; 

} 



} /* Y_02 loop */ 
} /* Y_01 loop */ 



} /• whUe still stuff left •/ 
return 1; 

} 



int TestBit(bitset, bit) 
20 int *bitset, bit; 
{ 

int what, this; 
unsigned char *bytes; 



bytes = (unsigned char *) bitset; 

25 what = bit % 8; 
this = bit / 8; 

return (bytes[this] & setbits[what] ); 

} 
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int FlagProduct(TheProducts, index!, index2, this) 

int *TheProducts; 

int indexl,index2, this; 

{ 

5 int what; 

unsigned char ^Products; 

/♦ if (DebugLevd) 

printf("%d %d, %d, %x\n%indexl,index2,this,ThePn)ducts);*/ 
Products = (unsigned char *) TheProducts; 

10 if (!this ) this = indexl*nY_02 -h index2; /* bit index */ 
what = this % 8; 
this /= 8; 

Products[this] | = setbits[what]; 
return 1; 
15 } 

int FlagReagentfTheReagent, size, index) 
int *TheReagcnt; 
int size, index; 

{ 

20 int what, this; 

unsigned char *Reagent; 

Reagent = (unsigned char *) TheReagent; 

what = index % 8; 
this = index / 8; 
25 Reagent[this] | = setbits[what]; 
return 1; 

} 
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/* Uese the intent is to select the next compound "intelligently**. 
We tiy to maximize use of one or the other reagent. 

*/ 

int SelectltO 
5 { 

intij; 

if (Whatl < 0) {GrabRandom( &i, &j); goto out; } 

switch (WhatFirsl{0]) 
{ 

case '0': 

GrabRandom( &i, &j); 
break; 

caseT: 

GrabThis(&i, &j, 1); 
break; 

case *2': 

GrabHiisC &i. &j. 2); 
break; 

} 

20 out: 

Ouq)utThisHitCi j); /* both print it and note it in bitsets */ 
return 1; 

} 

int GiabThts( pi, p2, lype) 
25 int 'pi, *p2, type; 
( 

unsigned char *p, *q, *pro; 
int index; 



10 



15 
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switch (type) 
{ 



case 1: 



if (!fiiKiQne(Dead_Products, Whatl*nY_02, 1, nY_02) && 
IfindOneOEtead.Products, What2 , nY_02, nY_01) &ft 
!GrabRandoin( pi , j^) ) return 0; 

break; 



case 2: 



if (!fiiKlCtae(Dead_Producls, Whal2 . nY_02, nY_01) && 
10 !findOne(Dead_Products, Whatl*nY_02, 1, nY_02) && 

!GrabRandoin( pi, p2) ) return 0; 

break; 

} 

*pl = Whatl; 
15 *p2 = What2; 



return 1; 



} 



i»fO 

/* This can be done more efficiently when we KNOW we are walking a vector V 
20 int findOne(bitset,start,incr,max) 
int *bitset, start, incr, max; 
{ 

int i; 

for (i=0;i < max;i + + , start + = incr ) 
25 { 

if ( TestBit(bitset, start)) continue; 
Whatl = start / nY_02; 
What2 = start % nY_02; 
return 1; 

30 } 

return 0; 
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} 

#else 

int findOne(bitset,slart,incr,inax) 
int *bitset, stait, incr, max; 
5 { . 

static int oldstart = -1234, 
oldincr, 
old_i; 

inti; 

10 if ( (start != oldstart) 1 1 (incr != oldincr) ) oldj = -1 ; 
oldstart = start; oldincr = incr; 
oldJ ++; 

start + = incr * oldJ; 
for (i=old_i;i<niax;i++, start += incr ) 
15 { 

if ( TestBit(bitset, ^art)) continue; 
Whatl = start / nY_02; 
Wha^ = start % nY_02; 
oid_i === i; 
20 return 1; 

} 

oldstart - -1234; 
return 0; 

} 

25 #endif 

int GrabRandoin( pi, p2, fp) 

int *pl. *p2. *fp; 

{ 

int index/sum; 
30 int valueU va]ue2; 

unsigned char *p, *q, *pro; 
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p = (undgned char *) Dead_Products; 

index = UTL^MATH^RANDQ * SomeLeft + 1; 

valuel = sum = 0; 

while (sum < index) 
5 {sum += nbits( -(*p++) & 255]; 
valuel +=-8; } 

p -= 1; sum -= nbits[ -(*p) &255 ]; valuel -= 9; value2 = (-(*p) & 255); 
while (sum < index) 
{valuel ++; 
10 if ( value2 & 1) sum++; 
value2 « value2 > > 1; } 

value2 = valuel % nYj02; 
valuel /= nY__02 ; 

*pl = Whatl = valuel; 
15 *p2 = What2 = valu^; 

return 1; 

} 

int Ou^utThisHit( index 1, index2) 
int indexl, index2; 
20 { 

iiit which; 

which = indexl*nY_02+index2; 

^rintf(Ou^utFile,"%s%d %d %d\n", PrefixForFiles, which+1, 
indexl + 1 ,index2+l); 
25 FlagProduct(Good_Products,0,0, which); 

FlagProduct(Dead_Products,0,0, which); /* can only be selected once */ 
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note use of reagents; this is slightly wasteful of time */ 
FlagReagent(Good_l, nYJ)l, indexl); 
FlagReagCTt(Good_2, nYJtt, index2); 

if (I>d)ugLevel) printf(*'Sdection %d is %d , %d\n'', 

nProcessed+ 1, indexl 4- 1 ,index2+l); 



letum 1; 
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, Appendix "K" 

/* dbcslnjdesign */ 

*/ 

/♦+C 

m 

* This program evaluates (proximate) Topomer+Tanimotp sirmlarity vs cSLNs 
10 * based on prq>iocessing of the substituent reagents. Using this, it 

* selects a diverse set of products while trying to maximize use of 

* some groups. Diversity is achieved by zapping all neighbors after each 

* new selection, so that any non-zsyyped product can freely be selected. 

15 * To be added: restart capability and reagent blackout. 

* (i.e. to recomplete an earlier design and/or to remove 

* all occurences of YJ)1 = 37 and so on when they 

* prove to be unavailable or oth^vise unsuitable). 

* limitations: currently exacdy 2 R groups arc assumed. Need to extend 
20 * to more than 2 and to handle X groups. 

The resultant file contains one line per hit, of the form 

* Yl Y2 

25 * where Yl = index of the substituent in XI. pro file 

* Y2 = index of tiie substituent in X2,pro file 

* Options: Look at the array Options below. 

30 * * 
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/ 

Aaclude <stdio.h> 

^include <sigiial.h> 

include <ctype.h> 
S #include <umsUi.h> 

#include <string.h> 

^include <sys/stath> 

ftnclude <inath.h> 

#include "parseq^.h" 
10 iNnclude "utl^str.h" 

#include "utl_inem.h" 

jWnclude "utl^file.h" 

#includc "utl^math.h" 

^include "ct.h* 
15 #include •ct_expr.h" 

#include "ct^proto.h" 

^include "importjroto.h" 

#include "io^fprint-h" 

#include 'commonData h** I* Globals use by most functions, we will dean this 
20 up soon */ 

iS^nclude "dbcsln^bs^protch" 
^include "dbcsln_hlm_pioto*h** 
Wefine OBSOLETE JS_OK 1 
FILE *debugFUe = (FILE *) NULL ; 

25 #ifdef OBSOLETE JS_OK 

/* these sections retain the filtering capabilities now also present 
in db^filter.c - at some point they should exist ONLY in db^filter, 

♦/ 

static struct RangelnfoStruct RangeValuesData ; 
30 static struct OneOflnfoStruct OneOfValuesData ; 
static struct InputlnfoStnict InputData ; 
static int NumRangeFields ; 
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static int NumRangeFieldsAUocated ; 

static RangeStnict *RangeFields ; 

static int NumOneOfFiddsAllocated ; 

static int NumOneOfFidds ; ' 

5 static OneOfStnict *OneOfValues ; 

static float **RangeValues_Y01 ; /* Actual values read in from nnn.Xl file. 
If MW is die first and logp is the second value 
spedfied on the -rangevar argument list th^ 
RangeValues_Y01[n][0] vrauld keep the value for MW 
10 for the nth line in the nnn.Xl file and 

RangeValues_Y01[n][lJ would keep the value for 
logp for that line*/ 

static float **RangeVaIues_Y02 ; /* same */ 

static int **OneOfValue!LY01 ; /* Actual values read from nnn.Xl files but translated 
IS into an index of OneOfValues[i].values so 

we dont have to waist memory and time doing strcmp*/ 
static int ♦♦OneOfValues_Y02 ; /* Same */ 
#endif 

static char ^MasterFile ; 
20 static char *MasterFileList ; 

static diar *BitsetFilelist ; 

static char ^MasterRecoid ; 

static FILE *MasterFile_File; 

stotic char *FngrFile; 
25 static int FingerCore_Card; 

static int *FingerCore_FP; 

static char *RangeVar ; 
static char *OneOfVar ; 
static double Tanimoto = 0.85; 



30 



static int WordsPerFingerprint = 0; 
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Static int BytesPerFingexPrint = 0; 

static int NoMorehitsPlease = 999999999; 

static int DebugLevd; 

static int UserAborted; 
S static char ^OutputFileName; 

static char ^CheckPointFileName; 

static char ^WhalFirst; 

static char ^InputSource = 0; 

static char ^itsetSouice = 0; 
10 static char *Daf2diaseNames = (char *)0 ; 
^ static char *lIitlistNames = (char *)0 ; 

static int BitOffsets[MAX_INPUT^CSLNS]; why recompute? */ 

int TotalProducts ; 

static int Pn)_si2e; 
IS static struct ParseOptions OptionsQ = { 

*** DO NOT MOVE ENTRIES IN THIS TABLE. ADD ENTRIES ONLY AT THE 
END, 

20 {-master", ParseOptString, &MasterFile, 

•Name is the file with master file records" }, 
{"masterlist". ParseOptString, AMasterFilelist, 

"Name of the file containing master input/result file names" } , 
{"bitsetlist". ParseOptString, ABilsetHleUst, 
25 "Name of the file containing bits^ input/result file names" }, 

{"index", ParseOptString, AMasta-Record, 

"Which MasterRecoFd or Bitset entry 1-n" }, 
{"Tanimoto", ParseOptDouble, &Tanimoto, 
•Similarity threshold (0.0 to 1.0)"}, 
30 {"distance", ParseOptDouble, &Distance, 

"Topomer distance (typically 75 to 100)" } , 
{"maxhits" . ParseOplInt, &NoMorehitsPlease, 
•Maximum number of hits before stopping" }, 
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{"bitset", ParseOptSlring, &BitsetSource, 

"BUset file to start from"}, 
{•output*, ParseOptString, AOutputFileName, 
•File to which hit info will be written. 
5 {•checlqxjint" , ParseOptString, &CbeclcPointFileName, 

"Rle to which bitset info will be written. •*}, 
{■prefer* . ParsepptString, &WhatFirst, 

"One of Rl, R2 to maximize us of."}, 
{"debug", ParseOptBoolean, &DebugLevel, 
10 "Use +debug to enable debugging messages" }, 

#ifdef OBSOLETEJS_OK 

{"rangevar" , ParseOptString, &RangeVar, 

"Scalar field name and range to filter out, i.e. logp -l.O 8,0 MW 200 500 price 0 
12,50"}, 

15 {"oneor , ParseOptString, &OneOfVar, 

"Field name and list of values that the product should match\n, i.e. supplier 
Aldrich,Sigma,Fluka,SALOR taste SWEET,Salty" }, 
>endif 

{ "database" , ParseOptString, &DatabaseNames, 
20 "Unity database to use to exclude possible products" }, 

{"hiUist", ParseOptString, &HitlistNames, 
"Unity hitlist to use to exclude possible products" }, 

}; 



static int WarmUpO 
25 { 
int i; 

for0=0;i<65536;i++) BigBitspl = (i&l) + (i&2)/2 + (i&4)/4 + (i&8)/8 + 
a&l6)/16 + (i&32)/32 + (i&64)/64 + Ci&128)/128 
+ (i&256)/256 4-(i&512)/512 +(i& 1024)/ 1024 
30 + Ci&2048)/2048 

+ (i&4096)/4096 + (i&8192)/8192 + (i&l 6384)/ 16384 
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+ Ci&32768)/32768 ; 

setbits_nbits_InitO; 
return 1; 

} 

5 static int WhatsTbeDifferenceO 
{ 

int i, j; 

#dcfincpow2(a)((a)*(a)) 

/* the assignment of codes is based on the following (from gai_pls.c): 
10 static fjptcutoffI16] = {9999., 0., 2., 4,, 6., 8„ 10., 12., 

14., 16., 18., 20., 22., 24., 26„ 30. }; 

l>oundary[0] = 9999.; /* missing data ought never to occur. */ 
boundary[l] = -0.1 ; 
15 fora=2;i< 15;i++) 
boundary[i] = 2*i-3; 
boundary[15] = 30.0; /* this is a steep curve with a cutoff at 30! */ 
for(i=0;i<16;i++) for Cj=0y <16y++) 
Dist[i]Q] = pow2( boundary[i] - boundaryQ]); 
20 Distance ♦= Distance; /* want to test D^2 directly ♦/ 
return 1; 

} 

static int CalcualteProductFingurePrint(product,firstPart,secondPart) 
int *produa ; 
25 int *firstPart ; 
int *secondPart ; 
{ 

int index; 

int totalBitsSet = 0 ; 
30 unsigned char *prod , ♦yOl, *y02 ; 

piod = ( unsigned char *)product ; 
yOl = ( unsigned char *)firstPart ; 
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y02 = ( unsigned char *)secondPart ; 

for (index =0;index < BytesPerFingerPrint;indcx + + ,prod + +) 

{ 

^^Mod = *y01++ I •y02++ ; 
5 totalBitsSet + = nbits['^iod & 2SS]; 

} 

return totalBitsS^ ; 

} 

static int IntersectQu^( plntr, pFP, pXntr, pXP, xuery, index) 
10 int *plntr, ♦*pFP; 
double '*^Xntr; 

unsigned char **pXP, **xuCTy; 

int index; 

{ 

15 unsigned char *ptr ,*qtr; 
int i, count; 
dmible xount; 

if(!(*pFP) II KW)) 
return 1 ; 
20 ptr = (unsigned char ^ *pFP; 
qtr = (unsigned char *) query; 
for{count=0, i=0; i<WordsPerFingerprint*4;i++) 

count += nbitsi *ptr++ & *qtr++]; 
*plntr = count; 
25 if(xuery) 

{ 

ptr = (unsigned char *) *pXP; 
qtr = (undgned char *) •xuery; 

for(xount=0,0, i=0; i<XytesPerFingerPrint[index];i+4-, ptr+ + , qtr++) 
30 xount += Dist[ ♦ptr &OxOF ][ *qtr&OxOF ] 

4- Dist[ (*ptr & OxFO) > > 4][ (*qtr & OxFO) > > 4] ; 
♦pXntr = xount; 
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} 

letum 1; 
} 

static int ActuallyCoinpute( indexl» iiidex2, pUnion, pintersection, pMaxTan^curraidnput) 
5 int indexl, index2, ^Union, *pIntersection; 
double *pMaxTan; 

{ 

int i; 

unsigned ^ort *h2, ^uery, pnMluct; 
10 int numberOfMissingBits ; 
if ( currentlnput -1 ) 

numberOfMis^ngBits = NumMissingBitstO] ; 

else 

numb^fMissingBits = NumMissingBits[cuiTentInput] ; 

15 hi = (unsigned short *) Y_01[indexl]; 
h2 = (unsigned short *) Y_02[index2]; 
hquery = (unsigned short *) query; 
*pUnion = ""^Intersection = 0; 

for(i=0; i<WordsPerFingerprint*2 ;i-l- + ,hi + +,h2++,hquery++) 
20 { 

/* product = (*hl I *h2) ;*/ 

*pUnion += BigBits[ (*hl | *h2) | ♦hquery]; 
♦pintersection BigBitsI (*hl | *h2) & *hquery]; 

} 

25 *pMaxTan = (double) (*pIntersection + numberOfMissingBits )/ (double) *pUnion; 
return 1; 

} 

static int 

ZapMNeighbors(thisQuery,thisC_Query,numZapped,doCTOPS .index 1 ,index2,currentlnput 
30 ) 

int *thisQuery ; 
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int thirf]!_Query ; 
int ^numZ^ped ; 
int doCTOPS ; 
int currentlnput ; 
5 { 

int cqt, q_lo» q_hi, i, j, carhold, inthold, onion, intsc; 
double max; 
int k ; 

int Y^OLOffiset. Y_02_Offset ; 
10 int pos ; 

int numberOfMissingBits ; 

if ( currentlnput == -1 ) 

numberOfMissingBits = NumM!SsingBits[0] ; 

else 

IS numberOfMissingBits = NumMissingBits[cunentInput] 

if ( thisQuery ) 
{ 

inenK:py(query,thisQuery,BytesPerFingerPrint) ; 
cjquery = thisC_QuCTy ; 

20 } 

'*'nuinZaiq)ed - 0 ; 

Y_01_OfFsct = Y_02_Ofrset = 0 ; 

for ( k = 0 ; k < Cumenanput ; k++ ) 

{ 

25 Y_01_0ffset += Y_01_Length[k); 

Y_02_Of]fset += Y_02_Length(k]; 

} 

for (i=0;i<nY_01;i++) 
if (! IntersectQuery( iY_01+i, 
30 Y_01+i, 

iX_01+i, 
X_01+i, 



W097/27K9 PCTAIS97A1491 

389 

(doCTOPS)?X_01 + index 1 + Y_01_Offeet 

NULL , 

0)) 

letumO; 
for 0 =0;i < nY_02,i+ +) 
if (! IntersectQueiy( iY_02-l-i, 

Y_02+i, 
iXj02+i, 
X_02+i, 

(doCTOPS)?X_02 + index2 + Y_02_Offset 

:NULL, 

D) 

return 0; ^ 
/• now zap topomer neighbors •/ 

/* 

** Only do tcqramer ndghbors if CTOPS was present in the input. 
*/ 

if ( doCTOPS ) 0 
{ 

Y_0I_Offset = Y_02_Offset = 0 ; 
for ( k = 0 ; k < Totallnputs ; k++ ) 
{ 

for(i= 0 •i< Y_OrLengthfk]:i++) 
{ 

if 0X_01I i + Y_0l_Offset ] > Distance) 
continue; 

for (j =0 ii < Y_02_Length[kly + +) 
{ 

if (UserAboited) 
return 1; 

if (iX_02[j+Y_02_Offset] > Distance) 

continue; ^ 
if (iX_01[i+Y_01_Offset] + iX_02|j+Y_02_Offset] < = 



wo 97/27559 PCT/US97A>1491 

390 

Distance && 

!TestBit(Dead_Products, 

BitMapStartPointpc] + i 

*Y_02_Length(k] +j) ) 

5 { 
if(DebugLevel 69) 
printfCDistance kiU %d %d - %f , %f + %fSn', 

i+lJ+1. iX_Ol[i] + iX_020], iX_OI01 . iXJKQ]); 

pos = ffitMai>StaftPoint[k] + i *Y_02_Length(kl +j ; 
10 FlagProduct(Dead_Pnxlucts. 0,0, pos ); 

SomeLeft-; 

Remaininglnputpc]- ; 
(*nuniZapped)++; 

} . 

15 } /* Y_02 loop •/ 

} /* YjOl loop */ 

Y_01_Offeet += Y_01_Length(k] ; 
Y_02_Offset += Y_02_Length[k] ; 

} 

20 } 

cqt = floor( (double) c_quay / Tanimoto); 

q_Io = flooF( (double) c_query * tanimoto - (double) numberOfMissingBits ); 
q_lu = ceil( (double) ( cjquery + numberOfMissingBits )/ Tanimoto); 
inTestBit = inActually = 0; 
25 Y_01_Offset = Y_02_Offset = 0 ; 

/• 

** Run thru all die in[Mit files, one at a time. 
*/ 

for ( k = 0 ; k < Totallnputs ; k++ ) 
30 i 

for(i= 0 ;i< Y_01_Length[k];i++) 
{ 

if (cY_01[i+Yj01_Offset] > cqt) { continue;} 
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carhold = <l1o - cY_01ti+Y_01_Offset]; 
inthold = q_Io - iY_01[i+Y_01_Offsetl; 
for 0 =0 y < Y_02_Length[klu + +) 
{ 

S if (Uso-Aborted) return 1; 

if (cY_(K0+YjQ2^Offset] > cqt) { continue; } 

if (cY_020+Y_a2_Ofrset] < carhold) { continue; } 

if fmthold > iY_020+Y_02_Offset]) { continue; } 

10 jfifdef Wastetimc 

time( &waste_time ); 

#endif 

if (TestBit(I>ead_Prc>ducts,BitMq>StartPoint[k] + i *Y_02_Lengtii(k] +j)) 
continue; 
15 ilfdef Waste_tim6 

inTestBit time( &tradi_time ) - waste_time; 

il^endif 

ActuaUyCompute( Y_0rOffset + i, Y_02_Offset + j, &onion, &intsc, &max,k); 
#ifdrfWaste_time 
20 inActually + = time( &waste_time )- trash time; 

#cndif 

if (max > = Tanimoto) 

if (DebugLevd ==69) 
25 printfCTanimolo kill %d %d - %6.3f , %d + %d, %d + %d\n", 
i+1 j+1, max. cY_0iri]>cY_02li], iY_Ol[il,iY_O201); 
pos = BitMapStartPointtk] + i *Y_02_Length(k] +j ; 
FlagProduct(Dead_Products, 0.0, pos ); 
SqmeLeft-; 
30 Remaininglnput[k}*- ; 

(*numZaiq)ed)++; 
if ( DebugLevel ) 

fprintf(stderr,*\nZ^ing %d %d %d",k+l,i+l,j+l); 
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} 



} /* Y_02 loop •/ 
} /* Y_01 loop */ 

YjOLOffset +«= Y_01_Length[k] ; 
5 Yj02_Of6et += Y_02_Length(k] ; 

} /* Number of Inputfile loop •/ 
iRfdef Wasle_tinie 

Visual((sUleiT,"ActiiallyCompute : %d",inActually)); 
Visual((stderr," TestBit : %d",inTestBit )); 

10 #endif 



15 



if ( DebugLevel ) 
{ 

intfied; 

for ( fred = 0 ; fred < Totallnputs ; fred++ ) 

DumpValues(fred,Y_01_Length[fredl,Y_02_Length[fred],Actually^ 

} 
} 



20 



/* 

** Abstract 



: Function zapps products who are missing CTOPS or FP fields. 



25 •* 

mm 



Usage 



Returns : 1 if the data value is missing or zero if the values exist. 



3d *♦ Algorithms : None. 
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Revision History : 



** Author 



Date 



Desciipti(Mi 




Fred Soltanshahi 



05/21/96 



Original version. 



•/ 



10 static int IsItMissingAValue(indexl,inda2,currenanput) 
int index 1 ; 
intindex2 ; 
{ 

int Y_01_Qffiset = 0 ; 
15 int Y_02_Offset = 0; 



} 

if ( ( Y_01(indexl+Y_01_Offsetl == NULL ) j | 

( Y_02[index2+Y_02_Offset] == NULL ) j } 
( X_Ol[indexl + Yj01_Offset] = = NULL ) } | 



int k ; . 



for ( k = 0 ; k < currentlnput ; k++ ) 



20 



Y_01_Offset += Y_01_Length[k] ; 
Y_02_Offset += Y_02_Length[k] ; 



25 



( X_02tindex2+Yj02_Offsetl == NULL ) ) 



return 1 ; 



TiBtum 0 ; 



30 } 

static int GetNextLine( FILE *filePointer,FILE *fingerfp,int *pCard,int ••pFP, 
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unsigned char **pXP, int index 
jWfdef OBSOI^TE_IS_OK 

, float **rangeValues, int **oneOfValues 

#endif 

5 ) 
{ 

char *line, *^pcard, *fp, *CTOPS ; 
int words, hold; 
int pos ; 

10 if (-1 == UTL_SCAN_GErS( filePointer, "W*, &line)) 
goto AddTiaceback ; 

fdef OBSOLETEJS_OK 
ReadlineAttributesOine, 

NumRangeFields, 
15 rangeValues, 
RangeFidds, 
NumOneOfPidds, 
oneOfValues, 
OneOfValues) ; 

20 #endif 

/* crops = strstrOine.-CTOPS=*)+strlenrCTOPS="); */ 
crops = strstr(line/CrOPS=-) ; 

if (!(*pFP = (int *) UTL_MEM_ALLOC( BytesPerFingerPrint))) 
goto AddTraceback ; 
25 if (!UTL_HLE_FREAD( pCard,sizeof(int), 1 .fingerfp)) 
goto AddTraceback ; 
if (!UTL FILE_FREAD( *pFP ,sizeof(int). WordsPerFingerprint /ingerfp)) 

goto AddTraceback ; 
if(CTOPS) 
30 { 
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CTOPS += strlen("CTOPS"); 
UTL_SCAN_TOKENIZE(CTOPS,';*,'\\'); 
UTL_SCAN_TOKe«ZE(CTOPS,* > ','\V); 
words = strien(CTOPS) / 2; /* must have 8 bit bytes */ 
5 if OXytesPerFingerPrintrmdex]) 

{ X3rtesP^ingerPnnt|index] = words; 

} 

if ( words ! = XytesPerRngerPrintruMfex]) goto MissingValue; 
*pXP - (unsigned diar *) UTL_MEM_ALLOC(woids); 
10 for (woFds«0;words<XytesPerFingerPrint[index);words++) 
{ 

memcpy(next2,CTOPS,2); 

CTOPS + = 2; 

sscanf(next2,'%2x", &hold); 
15 •(*pXP+words) = (unsigned char ) hold; 
} 
} 

return I; 
MissingValue : 
20 *pCard = 0 ; 

♦pFP = (int *)NULL ; 
*pXP = (unsigned char *)NU1X ; 
letum 1 ; 
AddTraceback : 
25 return 0 ; 

} 

static int not_here( what, nbytes ) 
unsigned char *what; 
int nbytes; 
30 { 

for ( ; nbytes; -nbytes) *what+ + = -*what; 
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letum 1; 

} 

/* this bdongs in the uO module, actually */ 
int MakeCoinLiiie( char *line. int kn, int argc, char *'*aigv) 
'5 { 

int i; 

sprintf(line,*%s ".afgvtO]); 
forO= l;i <argc;i+ +) 

{ 

10 line += strlen(line); 

^rintf(line;*%s ".argvlij); 

} 

} 

static void UserHitControlCO 
15 /•+! 

* This function is the agnal handler for user initiated program termination. 

* It*s only role is to set a flag indicating that the user wishes to abort the program. 

20 * Author Etete Description 

* G. B, Smith 02-09-93 Original Version 

*/ 

25 { 

UserAborted = 1; 

} 

Static int ReadEverythingO 

{ 

30 char*hold; 
char bufif[2551; 
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int i; 
intj; 

char *q) ; 
diar ; 
int offset ; 
int size ; 
char ^input ; 
FILE*fip; 
char *line ; 

void *bitset[MAX INPUT CSLNSJ ; 

»— — 

/* because failure here means end program run, no effort to clean up 
memory on error is included. */ 
offset=0; 

if (!WarmUpO 1 1 IWhatsTheDifferenceQ) . return 0; 

if ( MasterFileList } | BitsetFilelist ) 
{ 

if ( ! ( fp = fopen(MasterFileUst?MastCTFiIeList:BitsetFileList,"r")) ) 

return 0 ; 
Totallnputs = 0 ; 

whUe ( UTL_SCAN_GETS( fp, *\\-, "r, &Iine) != -1 ) 

{ 

strcpy(buff,line); 
cp - strtok(buff," "); 
InputNamesfTotallnputs] - UTL_STR~SAVE(cp); 
cp = strtok(NULL," 
InputStartRecfTotallnputs] = atoi(cp); 
cp = strtok(NULL/ 

OutputChedq)ointNamesrrotalInputs++] = lJTL_STR_SAVE(cp); 

} 

} 

else 
{ 
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if ( !MasterFile && !BitsetSource && IMasterFileList ) 
{ 

fprintf(stderr,*An input rile(inasta- or bitset) must be spedfiedVn*); 
return 0; 

if ( MasterFile && Bits^Source ) 
{ 

fyrintf(stdOT/A design run can be run from either a master or a bitset 

file\n"); 
10 return 0 ; 

} 

if (MasterFile && IKfasterRecord ) 
{ 

fiprintf(stderr,"A Bitset (or Master) record number must be specifiedVn"); 
15 return 0 ; 

} 

/* 

** Special case where we want to process all the records in the 
** master ffle. 
20 ♦/ 

if ( atoi(MasterRecbrd) == -1 ) 
{ 

if ( ( Totallnputs = CountMasterRecords(MasterFile)) = = 0 ) 
goto UnableToReadMaster ; 
25 for ( i = 0 ; i < Totallnputs ; i + + ) 

{ 

InputNames[i] = UTL_STR_SAVE(MasterFiIe); 
InputStartRecfi] = i+1 ; 

} 

30 } 

else 
{ 
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if ( MasterFile ) 

input = UTL_STR_SAVE(MasterFile); 

else 

input = UTLSTR_SAVE(BitsetSource); 

5 /• 

** If there are more than one input file, piocess than all. 
•/ 

cp = stitok(mput," "); 
while(q>) 
10 { 

InputNainesrrotalInputs++l = UTL_S'ni_SAVE(cp); 
cp = strtolc(NULL," 

} 

rap = strtok(MasterRecord," "); 
15 for ( i = 0 ; i < Totallnputs ; i++ ) 

{ 

/* 

** If Uie user spedfied tecord numbers for all the master files, then use them 
** othowise we will use the first record. 
20 */ 

if (mp) 

{ 

InputStartRec[i] = atoi(mp): 
mp = stitok(NULL," "); 

25 } 

else 

InputStartRec[i] = 1 ; 

} 

} 

30 mp = strtok(CheckPointFileName," "); 

for ( i = 0 : i < Totallnputs ; i++ ) 
{ 

if(mp) 
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{ 

OutputCheckpointNames[i] = UTL_STR_SAVE(mp); 
mp = strtok(NULL," "); 



{ 

sprintf(buff / %s- «d_chk.bs%baseiiaine(InpuW 
Ou^tCSiedqwintNamesfiJ = UTL_STR_SAVE(buff); 

} 

} 
} 

nY_01 = nY_02 = 0 ; 

for ( i = 0 ; i < Totallnputs ; 1++ ) 

{ ' 

if ( MasterFUe 1 1 MasterFileLisl ) 
{ 

if ( !RetrieveMasterFile(InputNanies[i]» 

MasterFile_File, 
InputStaitRecfi], 
&(NumMissingBits[i]), 
&(BitsInAbsentiaNoCount[i]), 
&(CoreFileNaines[i]) , 
&(Goie!Start[i]), 
&FngrFile, 
ApClfileft]), 
&(X2filc[il), 
&(Y_01_Length[i]), 
&(Y_02^Length[il), 
&(fingerFPril). 
&(fingerOffsets[i]), 
&ScreenFiieNaine» 
&BytesPerFingerPrint, 
&WordsPerFingerprint, 
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&queTy, 

&FingerCore_FP. 
&FingerCore_Card) ) 
goto UnableToReadMaster ; 
Remaininglnputfi] = Y^Ol^Loigthpl * Y_02_Lengthril; 



if(!(bitset(i] = CS_PRDCT^BrrSET_OPEN(InputNames[il, 

InputStartRecfi])) ) 

goto UnableToReadBitset ; 
if ( !RetrieveMasterFileFiomBitset(bitset[i], 

&(MasterFile_^Bitset[i]), 

&(StartRec_BitselIi]), 

&(NumMissingBits[i]) , 

&(BitslnAbsentiaNoCountri]), 

&(CoreFileNamesril), 

&(CoreStart[i]), 

&FngrFile, 

&(Xlfilefil). , 

&(X2file[i]), 

&(Y_OI_Length[i]), 

&(Y_02_Laigth[i]), 

&(fingerFP[i]), 

&(fingerOffsetsH), 

&ScreenFileName, 

&BytesPerFingerPiint, 

AWordsPerFingerprint, 

&queryy 

&FingerCore_FP, 
&FingerCore_Card) ) 
goto UnableToReadBitset ; 
RemaininglnputOl = CS_PRDCT_BrrSET_SELECTED(bitsetri]); 
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} 

nY_01 += Y_01_Length[il ; 
nY_02 += Y_02_Length[iJ ; 

} 

if (! (Y_01 = Ont *•) UTL_MEM_ALLOC(sizeofOnt *) * nY_01))) 

goto UnableToAllocat^emwy ; 
if (!(cY_01 = (int •) irrL_MEM_ALLOC(sizeofCint ) ♦ nY_01))) 

goto UnableToAUocateKfemory ; 
if (fCiYjDl = Ont •) UTL_MEM_ALLOC(sizeof(int ) • nYjOI))) 

goto UnableToAllocateMemory ; 
if (! (X_01 = (unsigned char •*) 

UTL_MEM_ALLOC(sizeof(unsigned char *)*nY_01))) 
goto UnableToAllocateMemory ; 
if (!(iX_01 = (double *) UTL_MEM_ALLOC(sizeof(double ) * nY_Ol))) 
goto UnableToAllocateMemory ; 
iWfilef OBSOLBI1E_IS_OK 
if( NumRangeFields ) 
{ 

if(!(RangeValues_Y01 = (float •*) UTL_MEM_ALLOC(sizeof(float •) * nY Ol))) 
goto UnableToAllocateMemory ; 

} 

if ( NumOneOfFields ) 
{ 

if (!(OneOfValues_Y01 = (int •*) UTL_MEM_ALLOC(sizeof(int *) * nY_01))) 
goto UnableToAllocateMemory ; 

} 

#endtf 
/* 

*• Read all the values for the XI file. 
*/ 

for ( j = 0 ; j < Totallnputs ; j++ ) 
{ 

if ( KfUeHandlesOl = fopen(XlfaeOJ,"r")) ) 



W097/2'W59 rcr/DS97/01491 

403 

goto UnableToOpenXiFile ; 
for Ci=0;i < Y_01_LengthD];i+ +) 
{ 

if (! GetNextLine( fildlandlesO], 
5 fingwFPD], 

cY_01+i + offset , 
Y_01+i + offset , 
X_01+i + offset , 
0 

10 #ifdefOBSOLErE_IS_OK 

.RangeValues_Y01 + i + offset , 

OneOfVaiues_Y01 + i + offset 

#endif 

)) return 0; 

15 } 

offset += Y_01_Lengthljl ; 
fcIose(fikHandles[j]); 

} 

if (! (Y_02 = (int •*) UTL_MEM_ALLOC(sizeof(int •) * nY_02))) 
20 goto UnableToAlIocateMemo^ ; 

if (!(cY_02 = (int •) UTL_MEM_ALLOC(sizeof(int ) • nY_02))) 

goto UnableToAllocaleMeinory ; 
if (!CiY_02 *= Cint *) UTL_MEM_ALLOC(sizeof(int ) • nY_02))) 
goto UnableToAllocateMemoiy ; 
25 if (! (X_02 = (unsigned char •*) 

UTL_MEM_ALLOC(sizeof(unsigt»ed diar *) * nY_02))) 
goto UnabieToAllocateMemory ; 
if (!(iX_02 = (double *) UTL_MEM_ALLOC(si2eof(double ) * nY_02))) 
goto UnableToAllocateMemoiy ; 



30 #ifdefOBSOLETE_IS_OK 

if ( NumRangeFields ) 
{ 
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if(!(RangeValiies_Y02 = (float **) UTL_MEM_ALLOC(sizeof(float *) 

nY.02))) 

goto UnableToAlIocateMemoiy ; 

} 

5 if( NumOneOfFields ) 

{ 

if (!(OneOfValues_Y02 = Cmt ♦*) im._MEM-AUXX:(sizeof(iBt *) * 

nY_02))) 

goto UnableToAlIocateMenuHy ; 

10 } 
' #endif 

offset = 0 ; 

for ( j = 0 ; j < Totallnputs ; j++ ) 
{ 

15 if ( !(fileHandles01 = fopen(X2fileO],"r")) ) 

goto UnableToAUocateMemory ; 
for (i=O;i<Y_O2_Length01;i++) 
{ 

if (! G^extline( fileHandlesG], 
20 fingerFPO], 

cY_02+i+offset , 
Y_02+i+offtet, 
X_02+i+offset , 
I 

25 fifdef OBSOLErE_IS_OK 

,RangeValues_Y02 + i + offset , 

OneOfValues_Y02 + i + offset 

#endif 

)) return 0; 

30 } 

offset += Y_02_Length[j] ; 
fciose(fileHandies[j])-, 

} 
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if (!Good_l) /* note: Good_l is never used but triggers other allocations */ 
{ 

i= (nY_01+31)/32 *4; 
5 if (!(Good_l = Ont *) UTL_MEM_ALLOC0))) return 0; memset( Good_1.0,i); 
if (!(DeadJ = (mi*) UTL_NCa4_ALL0Ca))) i«um 0; niemset( Dead_l,0,i); 
i= (nY_02+31)/32 • 4; 

if (!(Good_2 = (int •) UTL_MEM_ALLOCC0)) return 0; niemset( Good_2,0,i); 
if (!(Dead_2 = Ont •) UTL_MEM_ALLOCCi))) return 0; meinset( Dead_2,0,i); 

10 for ( a2e = 0,j = 0;j < Totallnputs ; j++ ) 

{ BitOffsetsQ] - size; 

aze+= ( Y_01_Lcngth[fl * Y_02_LengthIj] ) ; 

} 

Pio_aze = aze = ( size + 31 )/32 * 4 ; 
15 if (l(Ciood_Products = (int *) UTL_MEM_ALLOC(size))) return 0; 
memiset( Good_Products,0,»ze); 
if (!(Dead_Pioducts = (int •) UTL_MEM_ALLOC(size))) return 0; 

nieinset( Dead_Products,0,size); 
if ( !( MasterFile } i MasterFilelist ) ) /* gather the dead together.... */ 
20 Mortuary(bitset, Totallnputs, Dead_Products, size, BitOffsets); 

} 

offset = SomeLeft « 0 ; 
/* ■ 

** Figure out die number of products for each input set and the total 
25 ** number of products. 
•/ 

for ( j = 0 ; j < Totallnputs ; j++ ) 
{ 

BitMapStartPointQ] = offset ; 
30 offset + = Y_01_Length|j] * Y_02_Length[j] ; 

SomeLeft + = Remaininglnput[j]; 

} 
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TotailProducts = offset ; 
#ifdef OBSOLETE_IS_OK 
/* 

^* Initialize the needed structures to pass aiound. 
5 ♦/ 

RangeValuesDala.numRangeFields - NumKangeFidds ; 
RangeValuesData.rangeValues-¥Oi - RangeValues^YOl ; 
RangeValuesData.rangeValues_YQ2 = RangeValues_YQ2 ; 
RangeValuesData.rangeFields = RangeFidds ; 
10 OneOfValuesData.numOneOfFidds = NumOneOfFidds ; 

OneOfValucsI>ata.oneOfValues_Y01 = OneOfValues^YOl ; 
OneOfValuesData,oneOfVaIues_Y02 = OneOfValues_Y02 ; 
OneOfValuesData.oneOfFidds = OneOfValues ; 

#endif 

IS InputData.totalInputs = Totailnputs ; 

InputData.Y_01_Length = Y_01_Length ; 
InputI>ata.Y_02_Length = Y_02_Length ; 

/* 

** Read in the -rangevar values if they arc present in the csln file, 
20 */ 

«fdef OBSGLErE_IS^OK 

if( !ReadRangeVarFn>mCoieFiies(TotalInputs, 

CoreFileNames, 
CoreStart, 

25 NumRangeRelds, 



RangeFields) } 



return 0 ; 

#radif 
rebim 1; 
30 UnableToOpenXlFile : 

fprintf(stderr, "Unable to open reagant fileXn"); 

goto AddTraceback ; 
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UnableToAHocateMemory : 

fprintf(stdeiT/Unable to allocate memoryVn"); 

goto AddTnu:eback ; 
UnableToReadBitset : 
5 fjprintf(^derr/UnabIe to Read bitset fileXn"); 

goto AddTraceback ; 
UnableToReadMaster : 

fpriiitf(stdeiT/Unable to Read master fileXn"); 

goto AddTraceback ; 
10 AddTiaod>ack : 

return 0 ; 

} 

/* concatenate a series of compressed bitsets into one big raw bitset 
-> AND <- destroy those compressed bitsets */ 
15 int Mortuary(void *bitsetO, int nsets, int *rawbits,int byte_size, int *offset) 
{ 

int i ; 

for 0^0; i< nsets; i++) 

{ CS_PRDCT_BrrSET_CONCAT_RAW( bitset[il, rawbits. offsetpl, 0); 
20 CS_PRDCT^BrrSET_DESTROY_BIT_STRING(bitsettil); 
bitsetp] = NULL; 

} 

not_here( rawbits,byte_size ); 
25 static int ParseArguments( argc, argv ) 

* This function parses the command line arguments. 
30 * Returns: 1 on a successful command line parse, 0 otherwise. 
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* Warnings: 
* 

* Errors: 

5 * Sec Also: 



* Author Date Descrqytion 
10 ♦ G. B. Smith 02-09-93 Original Version 



int aigc; 
char **argv; 
15 { 

int naigs, 

noptions = sizeof( Options )/sizeof(Options[0]); 
Ou^tFile = stdout; 

nargs = UTL_PARSE_GPT( argc, argv, noptions, OpticNis ); 
20 if( Inargs ) goto SyntaxError, 

if (WhatFirst) 

{ if (strstr(WhalFirsl/Rr)) WhatFirst[01='r; 

if (strstr(WhatFirst,''R2-)) WhatFirsl[01='2'; 
} else { 

25 WhatFirst=UTL_MEM>LL0C(2); WhatFirst[01='0'; } 

jWfdef OBSOLETEJS_OK 
if ( RangeVar && ! 

ParseRangeVar(RangeVar,&NumRangeFieldsAllocated,&NumRangeFields,&I^g^^ 
goto SyntaxError ; 
30 if ( OneOfVar && 

!ParseOneOfVar(OneOfVar,&NumOneOfFieldsAllocated,&NumOneOfFields,& 
s)) 

goto SyntaxError ; 
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#cndif 

return I; 
Syntax&ror 
return 0; 

5 ] 

static int OpenOutputFileO 



10 



* Returns: 1 on sucesss, dse 0 
{ 

char 'msg; 

15 OutputFile = sidout; 

if( OutputFileName ) 
{ 

/* 

** We need to create output files under the ownership of the REAL user not the 
20 ** EFFECTIVE user. This wily i4)plies if setuid options are activated. 
*/ 

{ 

Struct Stat statBuff ; 
int uid ; 
25 int euid ; 

uid = getuidO ; 
euid = geteuidO; 
stat(OutputFileNaxne. &statBuff); 

/* 

30 ** There are two cases 

** (I) the file to output to exists 

** Use the ownership of the current owner of the file or if you cant do that 
** do not do anything. 
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** (2) The file is being created. 



4>« 



use the ownership of the REAL user. 



if ( access(OutputFileNaine. F_OK) --0 ) 
5 { If the file exist and the real user is the owner of die file */ 

if ( statBuff.st_uid = uid ) 
seleuid(uid); 

} 

else 

10 { /* Create the file as the REAL user ♦/ 

seteuid(uid); 

} 



Out|HitFile = fop^( OuqjutFileName, "wb"); 
15 if( lOutputtiie ) { 

fprintf(stderr,"Error: Failed to open output file V^SsX^Vn", 

OutputFileName ); 
goto ErrorRetum; 

} 



20 } 



return 1; 
ErrorRmm: 

return 0; 

} 



25 static CloseOutputFileO 
/♦+! 
* 

This function closes the output file. It is included just for cleanliness. 

4c 
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* Author 



Date 



Description 



* G. B. Smith 02-09-93 Original Version 



5 */ 
{ 

fclose( OutputFile ); 

} 

CheckPointPn^gram(pfpgramName) 
10 char ^ragramName ; 
{ 

int sizes[2] ; 
int allocSizes[2] ; 
int numlnSitesP] ; 
IS char hold[81] ; 
int i ; 

void ^'compressed ; 
int total ; 



for ( i = 0 ; i < Tolallnputs ; i++ ) 



sizes(01 = Y_01_Lengthri] ; 
dzes[l] = Y_02_Lcngtfini ; 
numIiiSites[0] - numInSites(l] - -1 ; 
allocSizes[0] « allocSizes[l] « -1 ; 



Lets get a compressed version of the dead products before we write it out 



compressed = CS_PRDCT_BITSEr_CREATE_BIT_STRING( 



20 



{ 



25 /* 



** to file. 



30 



Dead_Products, 
BitMapStartPoint(i], 
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2, 

sizes, 
sizes, 
&total); 

WriteOutGheckPointFile(OutputCheclqx>intNames[i^ 

( MasterFile j j Mast^Filelist ) ? InputNames[i] 

: Mast^File^itsetfi], 
( MasterFile 1 1 MasterFilelist ) ? InputScaitRec[i] 
: StartRec_Bitsetn]. 

pipgramName, 
Good^Products, 
BitMapStartPoint(i], 

sizes, 

15 allocSizcs, 

Selections[i], 

numlnSites, 

total, 

compressed); 

20 CS_PRIXT_BrrSET_DESTROY_BIT_STRING(compres^); 



10 



} 



} 



int inain( aigc, .argv ) 
/♦■HE 
25 • 
*/ 

int argc; 
char *'*argv; 
{ 

30 long staitTime, 

totalTime, 
finishTime; 
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ini numHlteied ; 
int numEliminated ; 
int tmp ; 

char cotnline[2048]; 

5 

*** Establish handler for a user tntmupt. 

signal( SIGINT» UserHitCcmtrolC); 
#ifdefSIGHUP 
10 signal( SIGHUP, UserHitControlC); 

iifendif 

if( !ParseArguments( argc, argv ) ) 

goto SyntaxError; 
if( iOpenOutputFileO ) goto FailureExit; 
15 /* if (IRestartStateO) goto FailureExit; */ 
time( &startTime ); 

Visual((stderr/Begin reading files: %s-,ctime(&startTime))); 
/* Let's actually do something now */ 

if (!Read£verything(NoMorehitsPlease)) goto FailureExit; 
20 time( &finishTime ); 

Visual((stderr/Begin filtering: %s",ctime(&fuiishTimc))); 
if 

(!FiltwPnxlucts(«dnputData,&RangeValuesData,&OneOfValuesData,&numFilter^ 
ingA Value)) 
25 goto FailureExit; 

Currentlnput = 0 ; 

tinie( &finishTime ); 

Visual((stderr,"FiItered out %d out of %d possible products\n",numFiltered, 
TotalProducts )); 



30 #if0 
/♦ 
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** Now sec if there are any hitlists or databases that you should filter 

**for. 

*/ 

ttfne( AfinishUme ); 
5 Visual((stderr/B^in eliminating selections in Unity database 

%s",ctinie(&finishTime))); 

if ( !EUnunate£roductsFiomI>atabase(I>atabaseNanies, 

AnumEliminatBd, 
ZapAlINdghbors)) 

10 goto FaiiureExit; 

time( &finishTime ); 
time( &finishTime ); 

Visual((stderr/B^in eliminating selections in Unity hitlist 
,ctime(&finishTime))); 
IS if ( !EliminateProductsFromHitlist(HitIistNames, 

ScreenFileName?ScreenFileName:DefaultScreenFiIeName, 

&tmp, 

Zjq)AllNeighbors)) 

20 goto FaiiureExit; 

timc( &fmishTime ); 

Visual((stdOT/Eliminated %d out of %d possible pnxlucts\n-,numEliniinated+tmp, 
TotaiProducts )); 
#endif 

25 Visual((stderr/Begin selection: ^s-.ctimeC&finishTirae))); 

if (!UserAborted&& 

! SelectEvery thing(InputSource, 

NoMordiitsPlease, 
WhatFirst, 

30 CalcualteProductFingurePrint, 

ActuallyCompute, 
ZapAUNeighbors)) goto FaiiureExit; 
CloseOutputFileO; 
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timeC &finishTime ); 

totalUme - finishTime - startTime; 
if( itotarnme ) totarTime =1; 

VisuaI((stde]T, •Created %d Selections in nProcessed )); 
5 Visual((stdOT/%d Hours, %d min, %d secsXn", 

tolarrnne/(60*60). 
(totalTiine%(60^))/60. 
(total'nine%60))); 
Visual((stdaT/Each comparison required %.8f seconds to calcuIateXn*, 
10 (totarrime/((double)(nProcessed?nProcessed: 1))))); 

Visual((stderr,"End Quick Select Computation: %s",ctime(&finishTime))); 
MakeComLine(comIine, 2048, argc, argv); 
CheckPointProgram(coniline); 
UserAborted ? exit(£rn>rEjcit) : exit(GoodExit); 
IS SyntaxError: 
exitd); 
FailureExit: 

exit(ErrorExit); 

} 
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Appendix "L* 

*/ 

/* dbcdn^both */ 

♦/ 

/♦ differs from dbcsln jie^gn ONLY in combining topomers and fp as 1 metric */ 
/♦ combined as = (Ratio ♦ CoMFA)*2 + (1-Tanimoto)^ */ 

10 */ 

* This program evaluates (s^roximate) Tqiomer+Tanimoto similarity vs cSLNs 
^ based cm preprocessing of the substituent reagents. Using this, it 

15 * selects a diverse set of products while trying to maximize use of 

* some groups. Diversity is achieved by zapping all neighbors after each 
new selection, so that any non-zqqped product can freely be selected. 

To be added: restart capability and reagent blackout. 
20 * (i«e. to recomplete an earlier design and/or to remove 

* all occurences of YJ)1= 37 and so on when they 

* prove to be unavailable or otherwise unsuitable). 

* Limitations: currently exactty 2 R groups are assumed. Need to extend 

* to more than 2 and to handle X groups. 
■25 • 

* The OBSOLETE file contains one line per hit, of the form 

* Yl Y2 

* where Yl = index of the substituent in XI. pro file 

* Y2 - index of the substituent in X2,pro file 
30 * 

* The REAL output is a ChemSpace bitset file. 



W097/27S59 FCTAJS97y0l491 

417 

* Options: Look at the array Options bdow. 



I 

Anclude <sldio.h> 

finciude <signal.h> 

<%idude <c^.h> 
10 #indude <unistd.h > 

jfinclude <string,h> 

#include <sys/stat.h> 

#indude <math.h> 

include "parseopth* 
15 #include "utl^str.h" 

finclude "utl^mem.h" 

#indude -utl^filch" 

ttndude "utl^math h" 

i^nclude "ct-h" 
20 jWnclude •ct_expr.h" 

Anclude "ct j)n)to.h'' 

^include "iinportj>Foto.h*' 

#include "io_fprint.h" 

#include "commonData.h" /* Globals use by most functions, we will dean this 
25 up soon */ 

iKnclude "dbcsln_bsj)roto,h - 
j^nclude *dbcsln_hlmj)roto.h" 
^define OBSOLETEJS^OK 1 
FILE *debugFile = (HLE ♦) NULL ; 



30 



#ifdef OBSOLETE JS^OK 

/* these sections retain the filtering aq>abilities now also present 
in db_filter.c - at some point they should exist ONLY in db_filter. 
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*/ 

Static struct RangebifoStnict RangeValuesData ; 

static struct OneOflnfoStnict OneOfValuesI>ata ; 

static strua InputlnfoStruct InputData ; 
S static int NumRangeFidds ; 

stadc int NumRangeFiddsAUocated ; 

static RangeStruct *RangeFidds ; 

static int NumOneOfFiddsAllocated ; 

static int NumOneQfFidds ; 

10 static OneOfStruct ^OneOfValues ; 

static float **RangeValues_Y01 ; /♦ Actual values read in from nnn.Xl file. 
If MW is the first and logp is the second value 
spedfied on the -rangevar argument list then 
RangcValues_Y01[n][01 would toep the value for MW 
IS for the nth line in the nnn.Xl file and 

RangeValues_Y01[n][ll would keep the value for 
logp for that line*/ 

static float **RangeValues_Y02 ; /* same */ 

static int ♦♦OneOfValucs_Y01 ; /^Actual values read from nnn.Xl files but translated 
20 into an index of OneOfValues[i].values so 

we dont have to waist memory and time doing strcmp*/ 
static int **OneOfValues_Y02 ; /* Same */ 
#endif 

static char '^MasterFile ; 
25 static char ♦MasterRecord ; 

static FILE *MastcrFile_File; 
static char *FngrFile; 
static int FingerCoie_Card; 
static int *FingerCore_FP; 



30 



static char ♦Range Var ; 
static char *OneOfVar ; 
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static double 



Ratio = 0.003 ; 



static int 

static int 

static int 
S static int 

static Int 

static char 

static char 

static char 
10 static char 

static char 

static char 

static char 

static int 
IS static int 

static double 

int TotalProducts 

static int 



WordsPerFingerprint = 0; 

BytesP^FingerPrint = 0; 

NoMorehitsPlease = 999999999; 

DebugLevel; 

UserAborted; 
^tputFfleName; 
♦CheckPdntFileName; 
♦WhatFirst; 
*InputSource = 0; 
*BitsetSource = 0; 
^DatabaseNames = (char *)0 ; 
♦HitlistNames = (char *)0 ; 

BitOffsets[MAXJNPUT_CSLNSl; /* why recompute? */ 

CoreSyni[MAXJNPUT__CSLNSl; 

*jX_01, *jX_02; 

Pro_size; 



static struct ParseOptions OptionsQ = { 
20 /*** 

*** DO NOT MOVE ENTRIES IN THIS TABLE. ADD ENTRIES ONLY AT THE 
END. 



7 



25 



30 



{"master", ParseOptString, AMasterFile, 
"Name is the file with master file records" }, 

{"index", ParseOptString, AMasterRecord. 
"Which MasterRecoFd or Bitset entry 1-n" }, 

{"comfe", ParseOptDouble, &Ratio , 
•Weighting for CoMFA fields (0.003)" }, 

{"distance", ParseOptDouble, &Distance, 
'Weighted neighborhood distance (0.240)" }, 

{"maxhits", ParseOpUnt, &NoMorehitsPlease, 
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"Maximum number of hits before stopping" }, 



{ "bitset" , ParseOptString, ABitsetSource, 
*Bits^ file to start from"}, 



5 



{ "ouQmt" , ParseOptString, &OutputFiIeName, 
"File to which hit info will be written. 



("checkpoint", ParseOptString, &CheckPointFileName, 
*File to which bitset info will be written. 



{"prefer", ParseOptString, &WhalFirst, 
"One of Rl, R2 to maximize us of."}. 



10 



{"debug" , ParseOptBooIean, &DebugLevel, 

"Use +debug to enable debugging messages** }, 



#ifdef OBSOLETEJS_OK 

{ " rangevar" , ParseOptString , &RangeVar , 

"Scalar field name and range to filter out, i.e. logp -1.0 8.0 MW 200 500 price 0 
15 12.50" }, 

{"oneor, ParseOptString, &OneOfVar, 

"Field name and list of values that the product should match\n, i.e. supplier 
Aldrich,Sigma,Fluka,SALOR taste SWEET.Salty" }, 
jfendif 

20 {"database", ParseOptString, &DatabaseNames, 



25 static int WarmUpO 
{ 

int i; 

for (i=0;t<65536;i++) BigBitsp] = (i&l) + (t&2)/2 + (\&4)/4 + (i&8)/8 + 



"Unity database to use to exclude possible products" }, 
{ "hitlist" , ParaeOptString, &Hit}istNames, 
"Unity hitlist to use to exclude possible products" }, 



}; 



30 



(i&l6)/16 + (i&32)/32 + (i&64)/64 + (i&128)/128 
+ (i&256)/256 +(i&512)/512 +(i&1024)/1024 
+ (i&2048)/2048 

+ (i&4096)/4096 + (i&8192)/8192 + (i&16384)/16384 
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+ a&32768)/32768 ; 

setbits_nbils_InitO; 
return 1; 

} 

S static int AVhatsfFheEHfTexenceO 
{ 

int i, j; 

#definepow2(a) ( (a) * (a) ) 

1^ the assignment of codes is based on the following (from genj>ls.c): 
10 static ^cutofft 16] = {9999., 0., 2.. 4., 6., 8., 10,, 12., 

14.. 16.. 18„ 20., 22., 24„ 26., 30. }; 

*/ 

boundary [0] = 9999.; /♦ missing data ought never to occur. */ 
boundary[l] = -0.1 * Ratio; 
15 for(i=2;i< 15;i++) 

boundaryD] - (2*i-3) * Ratio; 
boundaiy[15] = 30.0 * Ratio; /* this is a steq) curve with a cutoff at 30! */ 
for fi=:0;i < 16;i+ +) for (j =0u < 16a + +) 
Dist|i][j] = pow2( boundary[i] - boundary[j]); 
20 Distance *= Distance; /* want to test D"^ directly */ 
return 1; 

} 

static int CalcuaiteProductFingurePiint(product,firstPart.secondPart) 
int ^product ; 
25 int *rirstPart ; 
int *secondPart ; 
{ 

int index ; 

int totalBitsSet = 0 ; 
30 unsigned char *prod , *y0i, *y02 ; 

prod - ( unsigned char *)produci ; 
yOl = ( unsigned char *)firstPart ; 
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y02 = ( unsigned char *)secondPart ; 

for (index =0;index < By tesPerFingerPrint;index + + ,prod + +) 

{ 

♦prod = *y01 + + 1 *yQ2++ ; 
totalBitsSet nbits[*prod & 255]; 

} 

return totalBitsSet ; 



} 



static int IntersectQueiy( pintr, pFP. pXntr, pXP, xuery, index, 
10 symmetric, yuery, pXntr2) 

int *plntr, **pFP; 

double •pXntr, *pXntr2; 

unsigned char **pXP, **xuery, **yuery; 

int index, symmetric; 
15 { 

unsigned char *ptr ,*qtr; ^ 

int i, count; 

double xount; 

if (!(*pFP) II !(*pXP)) 
20 return 1 ; 

ptr = (unsigned char *) *pFP; 

qtr = (unsigned char *) query; 

for(count*0, i =0; i < WonlsPerFingcrprint*4;i + +) 
count + = nbits[ *ptr+ + & *qtr4- +]; 
25 *plntr = count; 

if ( xuery ) 

{ 

ptr = (unsigned char *) *pXP; 
qtr = (un^gned char *) •xuery; 
30 for(xount=0;0, i=0; i<XytesPerFingerPrint[index];i++, ptr++, qtr++) 
xount += Dist[ *ptr&OxOF ][*qtr& OxOF ] 

+ DistI (♦ptr & OxFO) > > 4][ ('qtr & OxFO) > > 4] ; 
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*t)Xntr = xount; 

if ( Isymmetric) return 1; 

= (un^gned char ♦) *pXP; 
5 qtr = (unsigned char *) ♦yuery; 

far(xount=0.0, i=0; i<XytesPerFingcrPrintrindex];i++. ptr++, qtr++) 
xQunt += Dist[ *ptr & OxOF ][ *qtr & OxOF ] 

+ Dist[ (*ptr & OxFO) > > 4][ (*qtr & OxFO) > > 4] ; 
•pXnti2 = xount ; 
10 ) 

return 1; 
} 

static int ActuallyConipute( indexl, index2, pUnion, plntersection, pMaxTan.currentlnput) 
int index l, index2, ^Union, "^Intersection; 
15 double *pMaxTan; 
{ 

int i; 

undgned short *hl, *h2, *hquery, product; 
int numberOfMissingBits ; 
20 . if ( currentlnput == -1 ) 

numberOfMissingBits NumMLSsingBits[0] ; 

dse 

numb^OfMissingBits = NumMissingBits[currentInput] ; 

hi = (unsigned short *) Y^Olpndexl]; 
25 h2 = (unsigned short *) Y_02[index2]; 
hquery = (unsigned short *) qu^; 
*pUnion == ^Intersection = 0; 

for( i=0; i<WordsPerFingerprint*2 ;i++,hl + + ,h2++,hquery++) 
{ 

30 /* product = (*hl | *h2) ;*/ 

*pUnion += BigBitsI (*hl | *h2) | *hquery]; 
♦plntersection += BigBits[ (*hl | *h2) & *hquery]; 
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} 

*pMaxTan = (double) (*pIntersection + numberOfMissingBits )/ (double) *pUnion; 
return 1; 

} 

S static int 

ZapAUNdghbors(thisQuefy,thisCyjuery,num^^ 
) 

iot nhisQuery ; 
int thisC_Query ; 
10 int *num2Lsq)ped ; 
int doCTOPS ; 
int cunentlniHit ; 

{ 

int cqt, q_lo, qjd^ i, j, carhold, inthold, onion, intsc ; 
IS double max, test,test2; 
int k ; 

int Y^Ol^Offset, Y_02_Offset ; 
int pos ; 

int numb^fMissingBits ; 
20 if (DebugLevel = = e9) 

printf(" time to zap %d - %d\n", indexl + 1, index2+l); 

if ( currentlnput == -1 ) 

numberOfMissingBits = NumMis^gBtts[0] ; 

else 

25 numberOfMissingBits = NumMissingBits[currentInput] ; 

if ( thisQuery ) 

{ 

inemq>y(query,thisQuery,BytesPerFingerPrint) ; 
c_query = thisC_Query ; 

30 } 

^numZapped = 0 ; 

Y 01 Offset = Y 02 Offset = 0 ; 
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for ( k = 0 ; k < Curremtlnput ; k++ ) 
{ 

Y_01_Offset += Y_01_Length[k]; 
Y_02_Offset += Y_02_Length(k]; 

} 

for Ci=G;i<nY_01;i++) 
if (! IntereectQuery( iY_Ol +i, 
Y_01+i, 
iX_01+i, 
X_01+i, 

(doCTOPS)?X_01 + indexl + Y_OI_Offset : NULL , 
0, 

CoreSym[CtirraitInput] , 

(doCTOPS)?X_02 + i|idex2 + Y_02_Offset : NULL . jX_01+i)) 

return 0; 
for (i=0;i<nY_02;i++) 
if (• InlersectQuery( iY_02+i, 
Y_02+i, 
iX_02+i, 
X_02+i, 

(doCTOPS)?X_02 + iiidex2 + Y_02_Gffset : NULL , 
1. 

C(HeSyin[CurrentInput], 

(doCTOPS)?X_01 + indexl + Y_01_Offset : NULL, jX_02+i)) 

return 0; 
/* now zap topomor ndghbors */ 

/* 

** Only do topomer ndghbors if CTOPS was present in the input. 
*/ 

if ( doCTOPS ) 
{ 

Y_01_Offset = Y_02_Offset = 0 ; 
for ( k = 0 ; k < Totallnputs ; k++ ) 
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for(i= 0 ;i< Y_01_Lengthnc];i++) 



5 Distance) ) 



10 



if ( tCoieSymtCunentlnput] && CiX_01( 1 + Y_01_Offsrt J > 
continue; 

for 0 =0 y < Y_02_Length[k]y + +) 
{ 

if (UserAborted) 
return 1; 
switch (CoreSymtCurrentlnputj) 

{ case 0: if ( iX_p2D+Y_02_Offset] > Distance) continue; 
test = iX_0in+Y_01_Offset] + 



iX_020+Y_02_Offset]; 



15 



jX_020+Y_02_Offset]; 



20 



25 



break; 
case 1: 

test = iX_01(i+Y_01_Offset] + iX_02D+Y_02_Offset]; 
test2= jX_0iri+Y_01_Offset] + 

if (test2 < test) test=test2; 
break; 

} 

if ( test < = Distance && 

!TestKt(Dead_Products, 

BitMapStartPoint[k] + i 



*Yj02_Lertgth[k] +j) && 
,currentlnput) ) 



Ain_I_Close(i + Y_01_Offset j + Y_02_Offset , test 



{ 



30 if (Dd)ugLevel == 69) 

printfCDistance kill %d 96d - %f , 96f + %f OR %f , %f + %An", 
i+l,j+l, iX_01(i] + iX_02(j], iX_01[i] , iX_02[j], 
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jX_OlIi] + jX_02|j], jX_01[i] . jX_02Dl ); 

pos = BitMapStartPoint(kl + i *Y_02_Length[k] +j ; 
FlagPioduct(Dead_Products, 0.0, pos ); 

SomeLeft— ; 

S Reinaininglnputpc]- ; 

(^numZapped) + + ; 

} 

} /* Y_02 loop •/ 
} /• Y_01 loop •/ 
10 Y_01_Offset +«= Y_Ol_Length[k] ; 

Y_02_Offset += Y_02_Length[k] ; 

} 
} 

} - 
15 static int Am_I_Close(i j,test,currOTtInput) 
intij; 
double test; 
int cunentTnput ; 
{ 

20 int onicHi, intsc; 

double max, Tanimoto; 
Tanimoto = 1. - sqrt( Distance - test); 
ActuallyCompute( i, j, &onion, &intsc, &max,cunentlnput); 
rc^rn( niax >= Tanimoto); 

25 } 
/* 



30 



** Abstract 



: Function zapps products who are missing CTOPS or FP fields. 
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m* 

** Usage : 

mm 

Returns : 1 if the data value is missing or zero if the values exist 

5 »* 

** Algorithms : None. 

mm 

** Revision History : 

mm 

10 Author Dale Description 

** Fred Soltanshahi 05/21/96 Original version. 

mm . 

15 

•/ 

static int IsItMisangAValue(indexl,index2,currentInput) 
intindexl ; 
int in(iex2 ; 
20 { 

int Y_01_Offset = 0 ; 
int Y_02_Offset = 0 ; 
int k ; 

for ( k = 0 ; k < cunenOnput ;Jl++ ) 
25 { 

Y_01_Offset += Y_01_Length[k] ; 
Y_02_Offset += Y_02_Length[kl ; 

} 

if ( ( Y_0irindexl+Y_01_Offset] == NULL ) | j 
30 ( Y_02[indcx2+Y^02_Offsct] == NULL ) 1 1 

( X_01[in<Iexl + Y_01_Ofrsetl = = NULL ) 1 1 
( X_02(index2+Y_02_Offset] == NULL ) ) 

{ 
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Fetum 1 ; 

} 

return 0 ; 

} 

5 static int GetNextUne( FILE *filePwnter,FBLE *fingcrfp,inl *pCanl,int **pFP, 
unsigned char ^^XP, int index 
jfifdef OBSOLETEJS.OK 

, float *^rangeValiies, int **oneOfValues 

#endif 

10 ) 
{ 

char *line, *fjpcard, *ft>, *CTOPS ; 
int words, hold; 
int pos ; 

15 if (-1 == UTL_SCAN_GETS( filePointer, 'W, &line)) 
goto AddTraceback ; 

#ifdef OBSOLETE JS_OK 
ReadUneAttributesQine, 

NumRangeFields, 
20 rangeValues, 
RangeFidds, 
NumOneOfFields, 
oneOfValues, 
OneOfValues) ; 

25 #endif 

/* CTOPS = strstrOine/CTOPS=**)+strlcn("CTOPS="); */ 
GTOPS = strstr(line,"CTOPS=") ; 

if (!(*pFP = (int *) UTL_MEM_ALLOC( BytesPerFingerPrint))) 
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goto AddTraceback ; 
if (!UTL_nLE_FREAD( pCard.sizeofCmt), 1 .fingerfp)) 

goto AddTracdnck ; 
if (!UTL_FILE_FREAD( ,sizeo((int), WordsPerFingerprint .finger^)) 
S goto AddTracdiack ; 

if ( CTOPS ) 

{ 

CTOPS +=strlen("CTOPS"); 

UTL_SCAN_T0KEN12E(CT0PS,';','\\'); 
10 UTL^SCAN_TOKENIZE(CTOPS,' > "W); 

words = strien(CTOPS) / 2; /• must have 8 bit bytes */ 
if (IXytesPerFingerPrintpndexj) 
{ XytesPerFingeiPrint[index] = words; 

} 

15 if ( words ! = XytesPerFingerPrint[index]) goto Missing Value; 
*pXP = (unsigned char *) UTL_^MEM_ALLOC(word$); 
for (wonls==0;words<XytesPcrFingerPrintIindex);words++) 
{ 

memcpy(next2,CTOPS,2); 
20 CTOPS +=2; 

sscanf(next2,"%2x", Ahold); 

*(*pXP+ words) = (unsigned char ) hold; 

} 
} 

25 return 1; 

MissingValue : 

*pCard = 0 ; 
*pFP = (int *)NULL ; 
♦pXP = (unsigned char •)hmLL ; 
30 return 1 ; 

AddTracd>ack : 
return 0 ; 
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} 

Static int not_heFe( what, nbytes ) 
unsigned char ^hat; 
int nbytes; 
5 { 

for ( ; nbytes; —nbytes) *what++ = — *what; 
return 1; 

} 

1^ this belongs in the uti module, actually '*'/ 
10 int MakeComUne( char *line, int len, int argc, char **argv) 
{ 

int i; 

sprintf(line,'%s ",argv[01); 
for(i = 1 ;i < argc;i + +) 
15 { 

line += strlen(line); 
sprintfOtne,"%s *,argv[il); 

} 

20 static void UserHitControlCQ 
/♦+! 

* This function is the signal handler for user initiated program termination. 

* It's only role is to set a flag indicating that the user wishes to abort the program. 
25 * 

* Author Date Description 

* G. B, Smith 02-09-93 Original Version 

30 */ 
{ 
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UserAborted = 1; 



static int ReadEverythingO 
{ 

5 char *hold; 
ctorbufipSS]; 
int i; 
intj; 

char *cp ; 
10 char *inp ; 
int offset ; 
int size ; 
char ^nput ; 

void *bitset[MAXJNPUT_CSLNS] ; 
15 /* because failure here means end program run, no effort to clean up 
memory on error is included. */ 
oflfset=0; 

if ( IMasterFile && !Bit3etSource ) 
{ 

20 fpiintf(stderr,'' An input file(master or bitset) must be spedfiedVn**); 

return 0 ; 

} 

if ( MasterFile && BitsetSouice ) 
{ 

25 fprintf(stdcrr/A design run can be run from either a master or a bitset 

fileXn-); 

return 0 ; 

} 

if (MasterFile && IMasterRecord ) 
30 { 

fprintf(stdCTr,"A Bitset (or Master) record number must be specifiedXn"); 
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mtum 0 ; 

} 

if (IWarmUpO 1 1 IWhatsTheDiffercnceO) return 0; 
/* 

5 Special case where we want to pnx:ess all the records in the 
♦* master file. 
*/ 

if ( atoi(MasterReconI) == -1 ) 
{ 

10 if ( ( Totallnputs = CountMasterRecords(MasterFile)) = = 0 ) 

goto UnableToReadMaster ; 
for ( i = 0 ; i < Totallnputs ; i++ ) 
{ 

InputNamespJ = UTL_STR_SAVE(MasterFile); 
15 InputStartRecp] = i+1 ; 

} 

} 

else 
{ 

20 if ( MasterFile ) 

input = UTL_STR_SAVE(MasterFile); 

else 

input = lJTL_STR_SAVE(BitsetSource); 

/* 

25 ** If there are more than one input file, process them all. 

cp = strtok(input/ 
while (cp) 

{ 

30 InputNameslTotaIInputs++] = UTL_STR_SAVE(q>); 

cp = stitok(NULL," "); 

} 

mp = strtokCMasterRecord," 
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for ( i = 0 ; i < Totallnputs ; i + + ) 
{ 

/* 

** If the user-spedfied record numbers for all the master files, then use them 
S ** otho^ivise we will use the first record. 
*/ 

if(mp) 

{ 

InputStaitRec[i] = at(ri(rop); 
10 mp = strtok(NULL,- -); 

} 

else 

InputStartRec(i] - 1 ; 



} 



15 } 



mp = stitok(CheckPointFileName," "); 
for(i = 0 ; i < Totallnputs ; i++ ) 
{ 

if ( mp ) 
20 { 

OutputCheckpointNamesp] = UTL_STR_SAVE(mp); 
mp = strtok(NULL,'' "); 

} 

dse 

25 { 

sprintf(buff,"%s_96d_chk.bs",basename(InputNames[i],NULL),i); 
OutputCheckpointNamcs[i] = UTL_STR_SAVE(buff); 

} 

} 

30 nY_01 = nY_02 = 0 ; 
if (Totallnputs > 1) 

(printf(stderr,"All files assumed to be for same coreAn"); 
for ( i = 0 ; i < Totallnputs ; i++ ) 
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if (MasterFfle) 
{ 

if ( !RetrieveKfasterFile(InputNainesri], 
5 MasterFile_Filc, 

InputStartRec[i], 
&(NumMis^ngBitsp]), 
&(BitsInAbsentiaNoCountri]), 
&(CoreFiieNaines|l])» 
10 &(CoreStartril), 

AFngrFile, 
&(Xlfile[il), 
&(X2fileril), 
&(Y_01_LengthIi]), 

15 &(Y_02^Length[i]), 

&(fingerFPti]), 
&(fingerOffsetsri]), 
&ScfeenFileName, 
&BytesPerFingerPrint, 

20 &WordsPcrFingCTprint, 

&query, 

&FingerCore_FP, 
&FingerCore_Cani) ) 
goto UnableToReadMaster ; 
25 Remaininglnputpl = YJ)l_Length[i] * Y_02_Lengthn]; 

} 

else 

{ 

if( !(bitsetri] = CS_PRDCT_BITSET_OPEN(InputNames[i], 
30 InputStartRecp])) ) 

goto UnableToReadBitset ; 
if ( !RetrieveMasterFileFroinBitset(bitset[i], 

&(MasterFile_Bitsetri]) , 
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&(StartRec_BitseCi)). 

&(NumMissingBits[i]), 

&(BitsInAbsentiaNoCount[i]), 

&(CoreFileNainesri]), 

&(CoieStart[i]), 

AFngrFtle, 

&pcifileriD» 
&(X2fiIeri]), 
&(Y_OLLengthri]). 

&(Y_02_Lengthri]), 

&(fmgerFP(i]). 

&(fingerOffsets[i]), 

&ScFeenFileName, 

&BytesPerFingerPrint. 

&WordsPerFingcq)rint, 

&queryi 

&FingerCorc_FP, 
&FingerCore_Cani) ) 
goto UnableToReadBitset ; 
Remaininglnput[i] = CS_PRDGT_BITSET_SELECTED(bitset[i]); 

} 

nY^Ol += Y__01_Length[il ; 
nY^02 + = Y_02_Length(i] ; 

R^eveSymmctry(CoieFileNames[i],CoreStart[i],&(C^ ); 

if (! (Y^Ol = (int **) UTL_MEM_ALLOC(si2eof(int *) ♦ nY_01))) 

goto UnableToAUocateMemory ; 
if (!(cY_^01 = (int *) UTL_MEM^ALLOC(si2eof(int ) ♦ nY_01))) 

goto UnableToAUocateMemory ; 
if (!(iY_01 = (int *) UTL_MEM>LLOC(sizeof(int ) * nY^Ol))) 

goto UnableToAUocateMemory ; 
if (! (X_01 = (unsigned char 

UTL_MEM_ALLOC(sizeof(unsigned char *)*nY_01))) 
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goto UnableToAllocateMemory ; 
if (!CiX_01 = (double *) UTL_MEM_ALLOC(sizeof(double ) * nY_01))) 

goto UnableToAIlocateMonoiy ; 
if {fQXjOl = (double *) UTL_MEM_ALLOC(sizeof(double ) • nY_01))) 
S goto UnableToAllocateMemory ; 

Afdef OBSOLETEJS.OK 
if ( NumRangeFields ) 
{ 

if(!(RangeValues_Y01 = (float *•) UTL_MEM_ALLOe(si2eof(float *) * nY_01))) 
10 goto UnableToAllocateMemory ; 

} 

if( NumOneOfFields ) 
{ 

if (!(OneOfValues_Y01 = (int UTL^MHVI_ALLOC(sizeof(int •) * nY_01))) 
IS goto UnableToAllocateMemory ; 

} 

#endif 
/• 

** Read all the values for the XI file. 
20 */ 

for ( j = 0 ; j < Totallnputs ; j++ ) 
{ 

if ( KfileHandlesQ] = fopenCXlfUeQl.'r")) ) 
goto UnableToOpenXlFile ; 
25 for(i=0;i<Y_01_Length01;i++) 
{ 

if (! GetNextUne( fileHandlesQ]. 
fingerFPq], 

cY 01 +i + offset , 

30 Y_01+i + offset , 

X_01+i + offset , 
0 

iWfdef OBSOLETEJS^OK 
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,RangeValues_Y01 + i + offset , 

OneOfValues_Y01 + i + offsrt 

)) retumO; 

5 } 

offset += Y_01_LcngthO) ; 
fdose(fiIeHandlesO]); 

} 

if (! (Y_02 = Ont *•) UTL.MEM_ALLOC(sizeofCint *) • nY_02))) . 
10 goto UnableToAUocateMemory ; 

if (!(cY_02 = (int *) UTL_MEM_ALLOC(sizeofOint ) • nY_02))) 

goto UnableToAllocateMembry ; 
if (!fiY_02 = (int *) UTL_MEM_ALUX:(sizeof(int ) * nY_02))) 
goto UnableToAllocateMemoiy ; 
15 if (! (XJ)2 = (unsigned char ••) 

UTL_MEM_ALLOC(sizeof(unsigned char •) * nY_02))) 
goto UnableToAUocateMemory ; 
if (!fiX_02 = (double *) UTL_MEM_ALLOC(sizeof(double ) « nY_02))) 
goto UnableToAUocateMemory ; 
20 if (!0X_O2 = (double *) UTL_MEKLALLOC(sizeof(double ) * nY_02))) 

goto UnableToAllocateMemoiy ; 

#ifdef OBSOLETE_IS_GK 

if ( NumRangeFields ) 
( 

25 if(!(RangeValues_Y02 = (ttoat •*) UTL_MEM_ALLCX:(sizeof(float *) * 

nY_02))) 

goto UnableToAUocateMemory ; 

) 

if ( NumOneOfFields ) 
30 ( 

if (!(OneOfValues_Y02 = (int **) UTL_MEM_ALLOC(sizeof(int •) * 

nY_02))) 
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goto UnableToAUocateMemory ; 

} 

#endif 

offset s 0 ; 

5 for ( j = 0 ; j < Totallnputs ; j++ ) 

{ 

if ( !(fUeHandlesO] '= fopen(X2fUeG]/r-)) ) 

goto UnableToAllocat^emory ; 
for 0=0;i<Y_02_Length01;i++) 
10 { 

if (! GetNextUne( fiteHandlesQ], 
fingerFPO], 

cY_02+i+offset , 
Y_(K+i+offset, 

15 X_02+i+offset , 

1 

«fdef OBSOLErE_IS_OK 

,RangeValues_Y02 + i + offset , 

OneOfValues_Y02 + i + offset 

20 #endif 

)) return 0; 

} 

offset + = Y_02_LengthO] ; 
fclose(fileHandles[i]); 

25 ] 

if (!Good_l) /* note: Good_l is never used but triggers other allocations */ 

{ 

i= (nY_01+31)/32 '4; 
30 if (!(Good_l = Ont *) UTL_MEM_ALLOC(i))) return 0; inemset( Good_l,0,i); 
if (!(Dead_l = (mt *) UTL_MEM_ALIjOCCi))) return 0; memset( Dead_l,0,i); 
i= (nY_02+31)/32 • 4; 

if (!(Goo<i_2 = (int *) UTL_MEM_ALLOG(i))) return 0; meinset( Good_2,0.i); 
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= (int *) UTL_MEM_ALLOC(i))) retum 0; memset( Dead_2,0,i); 



for(aze=0,j=0;j< Totallnputs ; j++ ) 
{ BitOffs^] = size; 

size + = ( Y^Ol^LenglhOl * Y^02^Lengthtj] ) ; 

5 ) 

ProLsize = size = ( size + 31 )/32 * 4 ; 
if (!(Good_Pn)ducts = Ont •) UTL_MEM_ALLOC(size))) return 0; 

meinse(( Oood^ProductsAsize); 
if (KDcad^Products = (int *) UTL_MEM_ALLOC(sizc))) return 0; 
10 inefnset( Dead_Products,0,size); 

if ( IMasterFile ) /* gatiier the dead together. ... */ 

Mortuary(bitset, Totallnputs, Dead^Products, size, BitOffsets); 

} 

offset = SomeLeft = 0 ; 
15 /* 

Figure out the number of products for each input set and the total 
** number of products. 
*/ 

for ( j = 0 ; j < Totallnputs ; j++ ) 
20 { 

BitMapStartPointQ] = offset ; 

offset Y_0rLengthDl ♦ Y;_02_Length[j] ; 

SomeLeft RemaininglnputQ]; 

} 

25 TotalProducts = offset ; 
jWfdef OBSOLETEJS^OK 
/* 

Initialize the needed structures to pass around. 

*/ 

30 RangeValuesData.numRangeFields = NumRangeFields ; 

RangeValuesData.rangeValues^YOl = RangeValues_Y01 ; 
RangeValuesI>ata.rangeValues_Y02 = RangeValues_Y02 ; 
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RangeValuesData.rangeFidds = RangeFields ; 
ChieOfValuesData^numOneOfFields - NumOneOfFields ; 
OneOfValuesData.oiieOfValues_Y01 = OieOfValues^YOl ; 
€)ncOfValuesDat2LoneOfValues_Y02 - OneOfValiies_Y02 ; 
5 OneOfValuesDala,oneOfFidds = OneOfValues ; 

#endif 

Input]>ata.lotaUnputs Totallnputs ; 
InputData,Yj01_Length = YjOl.Length ; 
InputDala. Y_02_Length = YJH^Loigth ; 

10 #if 0 
/* 

** Read in the -rangevar values if they arc present in the csin file. 
♦/ 

if( !ReadRangeVarFromCoreFile( Corefile, 
IS NumRangeFieids, 

RangeFields) ) 

return 0 ; 

#endif 
return 1; 
20 UriableTdOpenXlFile : 

fjprintf(stderT, "Unable to open reagant fileXn**); 

goto AddTracd)ack ; 
UnableToAllocateMemory : 

ijprintf(stderT/Unable to allocate memoryXn"); 
25 goto AddTraceback ; 

UnableToReadBitset : 

fprintf(stderr/ Unable to Read bitset file\n"); 

goto AddTracd>ack ; 
UnableToReadMaster : 
30 fprintf(stderr, "Unable to Read master file\n**); 

goto AddTraceback ; 
AddTracd)ack : 

return 0 ; 
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} 

int RetrieveSymmetryC char ^ileName, int Start, int *pSym ) 
{ 

FILE *tinp; 
: 5 char *linc; 

if (!(tinp = fcq)en(HleNamc,_!:r"))) return 0; 
for ( ; Start; Start-) 

if (-1 === UTL_SCAN_GBrS( tmp, "W, T. Mne)) tetum 0 ; 
fclose(tinp); 

10 if (strstrOine/SYM^r)) *pSym = 1; else *pSym = 0; 
letum 1; 

} 

concatenate a series of compressed bitsets into one big raw bitset 
15 -> AND <- destroy those compressed bitsets */ 
int Mortuaiy(void *bitsetn, int nsets, int *rawbits,int byte^size. int *offset) 
{ 

int i ; 

for 0=0; i< nsets; i++) 
20 { CS_PRDCT_B1TSET_C0NCAT_RAW( bitsetO). rawbits, offset[i], 0); 
CS_PRDCT_BITSEr_DESTROY_BIT_STRING(bitset(i]); 
bitsetpl = NULL; 

} 

not_hMe( iawbits,byte_size ); 
25 } 



static int ParseArguments( argc, argv ) 
/♦+! 

* This function parses the command line arguments. 
30 * 
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* Returns: 1 on a successful command line parse, 0 otherwise. 

* Warnings: 
* 

5 * Errors: 
♦See Also: 

m 

10 * Author Date Description 

* G, B. Smith 02-09-93 Original Version 
* 

*/ 

IS int aigc; 
char **argv; 

{ 

int nargs, 

npptions = sizeof( Options )/sizeof(Options[0]); 
20 OutputFile = stdout; 

nargs = UTL_PARSE^GPT( argc, aigv, noptions. Options ); 
if( Inargs ) goto Syntax&ipn 
if (WhatFirst) 

{ if (stistr(WhatFirst/Rr)) WhatFirst[0]='r; 
25 if (strstr(WhatFirst/R2")) WhatFirst[0] =^2*; 

} else { 

WhatFirst=UTL^MEM^ALLOC(2); WhatFirst(0]='0'; } 
iWfdef OBSOLETEJS^OK 
if ( RangeVar && ! 

30 ParseRangeVar(RangeVar,&NumRangeFieldsAIlocated,&NumRangeFields,&RangeFi 
goto SyntaxError ; 
if ( OneOfVar && 

!ParseOneOfVar(OncOfVar,&NumOneOfFieldsAUocated,&NumOneOfFiel^ 
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s)) 

goto SyntaxError ; 

return 1; 
5 SyntaxError 

return 0; 

} 

static int OpenOu^tFUeO 
10 * 

* R^ums: i on sucesss, else 0 

{ 

15 char *msg; 

FILE *fp; 
Ou^tnie = stdout; 
if( Ou^tFaeName ) 
{ 

20 /* 

** We need to create output files under the ownership of the REAL user not the 

*♦ EFFECTIVE user. This only s^lies if setuid options are activated. 

♦/ 

{ 

25 struct Stat statBuff ; 
int uid ; 
int euid ; 

uid = getuidO ; 
euid ~ geteuidO; 
30 stat(OutputRleNafne, &statBufO; 

** Thext are two cases 

** (1) Ae file to output to exists 
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** Use the ownership of the current owner of the file or if you cant do that 

do not do anything. 
** (2) The file is bring created. 
**** use the ownership of the REAL user. 
5 */ 

if ( access(Ou^tFileName, F_OK) == 0 ) 
{ /* If the file exist and the real user is the owner of the file 
if ( statBuff.st^uid = = uid ) 
seteuid(uid); 

10 } 

else 

{ /* Create tfie file as the REAL user */ 
seteuid(uid); 

} 

15 } 

OuqnitFile = ftq)en( OutputFil^ame. "wb"); 
if( lOutputFile ) { 

fpiintf(slden,'Enor: Failed to open output file \"%s\''\n", 
OutputFileName ); 
20 goto EnorRetum; 

} 

} 

return 1; 
ErrorRctum: 
25 return 0; 

} 



static CloseOu^utFileO 
/*+! 
* 
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• This fiinctum closes the output file. It is included just for cleanliness. 
♦Author Date Description 



5 *G.B. Smith 02-09-93 Original Version 
* 

*/ 

{ 

fclose( OutputFile ); 

10 } 

CheckPointPrognun(programName) 
dmr '^rogramName ; 
{ - 

int sizes|2] ; 
IS int allocSizes[2] ; 
int numInSites[2] ; 
char hold[81] ; 
int i ; 

void ^compressed ; 
20 int total ; 

for ( i = 0 ; i < Totallnputs ; i++ ) 
{ 

sizes[0] = Y_01_Length[i] ; 
sizes[l] = Y_02_Lwigth[i] ; 
25 numInSites[0] = numlnSites[l] = -1 ; 

allocSizes[0] = allocSizes[ll = -1 ; 

/* 

•* Lets get a compressed version of the dead products before we write it out 
*• to file. 
30 •/ 

compressed = CS_PRDCT_BrrSET_CREATE_BIT_STRING( 
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Dead^Pioducts, 

BitMapStartPoint[i], 

2, 

sizes, 

5 ^zes, 

&total); 

WriteOutaieckIV>intFile(OutputC%eck^^ 
MasterFile ? InputNamesH] 

: MasterFiie_Bit$et[i], 
10 MasterFile ? InputStartRec[i] 

: StartRec_Bitset[i], 

programName, 
Good_Products, 
BitMapStartPoint[i], 
15 2, 

sizes, 
ailocSizes, 
Selections[i], 
numlnSites, 

20 total, 

compressed); 

CS_PRDGT_BITSET^DESTROY_Blt_STRING(compressed); 



} 



} 



25 int main( argc, argv ) 
/*+E 

int argc; 
30 char **argv; 
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{ 

long startTime, 
totalTime, 
finishlline; 

5 int numFiltered ; 

iat numEUminated ; 
-int tmp ; 

char coinliae(20481; 

10 Establish handler for a user interrupt. 

♦**/ 

signal( SIGINT, UserHitControlC); 
#ifdefSIGHUP 

signal( SIGHUP, UserHitControlC); 

15 #cndif 
/* 

♦* Initialize variables. 
*/ 

Distance = 0,240; 
20 if( !ParseAiguments( argc, argv ) ) 

goto SyntaxError; 
if( iOpenOu^utFileO ) goto FailureExit; 
/* if (!RestartStateO) goto FailureExit; */ 
time( &startTime ); 
25 Visual((stderr, "Begin reading files: %s-,ctinie(&startTime))); 

/♦ Let's actually do something now */ 

if (!ReadEverything(NoMorehitsPlease)) goto FailureExit; 
time( &finishTime ); 

Visual((stderr, "Begin filtering: %s-,ctime(&fmishTime))); 
30 if 

(!FilterProducts(&InputData,&RangeValuesData.&OneOfValuesData,&numF 
ingA Value)) 
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goto FailureExit; 
Cunentlnput = 0 ; 
time( &finishTime ); 

Visual((stderr/Filtered out %d out of %d pos^ble products\n*',nuinFiltaed, 
5 TotalProducts )); 

#if 0 
/• 

Now see if there are any hitlists or databases that you should filter 
** for. 
10 */ 

time( &finishTime ); 

Visual((stderr/Begin eliminating selections in Unity database 
% s" ,ctime(&fmishTime))); 

if ( !EliminateProductsFromDatabase(DatabaseNanies, 
IS &nuniEliminated, 

ZapAllNeighbors)) 

goto FailureExit; 
time( &finishTinie ); 
time( &finishTime ); 
20 Visual((stderr/ Begin eliminating selections in Unity hitlist 

%s",ctime(&finishTime))); 

if ( !EliminateProductsFiomHitlist(HidistNames« 

ScreraFiIeName?ScremFileName: DefaultScieenFileName, 
25 &tmp, 

ZapAllNeighbors)) 

goto FailureExit; 
time( &finishTime ); 

Visual((stden/Eliminated %d out of %d possible pFoducts\n",numEliminated-f tmp, 
30 TotalProducts )); 
#endif 

Visual((stderr,"Begin selection: ^Js^.ctimeC&finishTime))); 
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if (lUserAborted && 

!SelectEverything(InputSource, 

NoMorehitsPliease, 
WhatFirst. 

Calcualtd^uctFingurePrint, 
ActuallyCompute, 
ZapAllNdghbors)) goto FailuieExit; 

CloseOu^HitFileO; 
tiine( &finishT1ine ); 

totalllme = finishTime - startTlme; 
if( ItotalTime ) totalTime =1; 

Visual((stdeiT, "Created %d Selections in nProcessed )); 
Visual((stdeiT,"%d Hours, %d min, %d secsXn", 

totaITiine/(60*60), 

(totairime%(60*60))/60, 

{totaITime%60))); 

VisualCCstderr/Each comparison required %.8f seconds to calculateXn", 
(totalTin[ie/((double)(nProecssed?nProcessed: 1))))); 

Visual((stdeiT,*End Quick Sdect Computation: %s*,ctime(&finishTime))); 

MakeComLine(comline, 2048, argc, argv); 

CheckPointProgram(comline) ; 

UserAborted ? exit(&iorExit) : exit(GoodExit); 
Syntax&Tor: 

exit(l); 
FailureExit: 

exit(ErForExit); 

} 
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A ppendix "M" 

*/ 

/* lis&itset */ 

V 

ii^idude <stdio.h> 

include <signal.h> 

jNnclude <ctype.h> 
10 #indude <unistd.h> 

jj^clude <string.h> 

include <sys/staLh> 

#include <inath.h> 

#include "parseopth" 
15 include "utl^str.h" 

#include "uti^mem.h" 

#include "uU^file.h* 

ftnclude ^utl^math.h" 

findude "cth" 
20 #include "ct^expnh" 

#indude "ctj)foto,h* 

Include "import jroto.h" 

finclude •iq_fprint.h" 

^include "hits.h" 
25 jfinclude "hits_proto.h" 

^include "commonData.h" /* Globals use by most functions, we will clean this 

up soon */ 

^include "dbcsln^bsjroto.h" 
^include "dbcsln^hlmjproto.h** 
30 extern char *basenameO ; 

extern char *DB_CT^CCT_FIXJLNO; 
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typedef enum 
{ 

HeaderOnly, 
FuUUsting, 
5 DelailUsting, 
Hitlist 
}-XtstingOptions ; 

static char ^OpdonsNamesD = { "header*, "full", •detail" , "strjist" } ; 
static ListingQptions iistOption = HeadeiOnly ; 
10 static char *HitlistFile ; 
static char '*'BitsetFile ; 
static int UserAborted; 

static char *CombNameTemplate= (char *)NULL ; 
static int GombCounter; 
15 char *HitiistName = "listbitset-hits" ; 
FILE *HitFae; 
static char *Prefix ; 
static struct ParseOptions OptionsQ = 
{ 

20 /*** 

DO NOT MOVE QJTRIES IN THIS TABLE. ADD ENTRIES ONLY AT THE 
END. 
**♦/ 

{•hitiisr. ParseOptString, &HitiistFile, 
25 "Name is the file with hidist records, ie. xxxxx.hits file" }, 

^ {"bitset", ParseOptString, ABitsetFile, 
"Name is die bitset file . ie. xxxxx.csr file" }, 
{"list", ParseOptEnum, (void *)OptionsNames , 

"Typeof output" }. 
30 {"output" , ParseOptString, &HitiistName, 

"Name is the file with hitiist records, ie. xxxxx.hits file" }, 
{"prefix", ParseOptString, &Prefix, 

"Prefix for naming the products.Product name will be Prefix_Y01_Y02_n"}, 
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}: 

static void UserHitContiolCO 

5 * This function is the signal handler for user initiated program 
t^mination. 

* It's only role is to set a flag indicating that the user wishes to abort 

the program. 
* 

10 * Author Date Description 

* G. B, Smith 02-09-93 Original Version 
* 

*/ 
15 { 

UserAborted - 1; 

} 

static int ParseArguments( argc, argv ) 
20 * 

* This function parses the command line arguments. 

* Returns: 1 on a successful command line parse, 0 otherwise. 
25 * Warnings: 

* Errors: 

■ * 

* See Also: 
30 * 
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* Author Date Description 

♦ G, B. Smith 02-09-93 Original Version 

5 */ 

int argc; 
char **aigv; 
{ 

char *fileiypeNanie ; 
10 int nargs, 

ncyptions — sizeof( Options )/sizeof(Options[0]); 
nargs = UTL_PARSE_OPT( argc, argv, noptions. Options ); 
if( Inaigs ) goto SyntaxError, 
fileTypieName == *((char**)Options[2]. value); 
15 if( !strcmp( 'detail-, fileTypcName )) 

listOption = DetailListing; 
else if( !strcnip( "lull", fileTypeName )) 

listOption - FullUsting; 
else if( lstrcmp( "strjisf , fUeTypeName )) 
20 listOpticm = Hitiist; 

else 

listOption = HeaderOnly; 
return 1; 
SyntaxErron 
25 return 0; 

} 

int Caliba£kFunc(ct,numAttachments,indexes) 
strua CtConnectionTable *ct ; 
int numAttachments ; 
30 int indexesQ ; 

{ 

static char *sln = 0; 
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static diar naineBuffer[41]; 
CoiribGounter += 1; 

if ( CombNameTemplate && *CombNaincTcmplate ) 
{ 

5 if ( !DB_CT_CT_ATTR_EXISTS( ct, CtCtName ) && 

!DB_CT_.CT_ATrR_EXISTS( ct, CtQRegId )) 

{ 

(void)siOTntf{ nameBuffer, "%,30s_%d_%d_%0d", 
CombNameTemplate, 

10 tndexes[0], 

indexes[l], 

, CombCount^); 
if (!DB_CT_SET_CT_ATrR( ct, CtCtName, nameBuffer )) 
goto tic; 

15 } 
} 

if ( sin = DB_CT_SLN_GENERATE( ct )) 
{ 

^tf( (FILE*)HitFile, "%s\n", sin ); 
20 UTL_MEM_FREE( sin ); 

} 

dse { 

if ( UTL_ERROR_IS_SET0) 
goto trc; 

25 } ' 

return 1; 

trc: UTL_ERROR_ADD^TRACE( "CallbackFunc" ); 
retumO; 

} 



30 static struct HitsHitList *CreateHitIist( hName. cSin, core ) 
char *hName; 
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char *cSin; 
char *core; 
{ 

struct HitsHitUst *hiUist; 
5 char *bname; 

int len; 

hitUst = DB_HrrS_CREATE( "STRLIST" ); 
if ( Ihitlist ) goto trc; 

if OhName 1 1 !*hName ) hName = -SGRATCH"; 
10 bname = (char *) basenaine( hName, (char*)0 ); 

DB_HrrSJSET_ATTR{ hitlist. HitsAttrName, bname ); 
if (Prefix) 

CombNameTemplate = UTL_STR_SAVE(Prefix); 

15 } 

else 

{ 

if ( ! CombNameTemplate && bname ) 

{ 

20 CombNameTemplate = bname; 

for Qsa = 0, bname = CombNameTemplate+1; *bname; ++bname, 

++len ) 

{ 

if ( !isalnuin( *bname ) 1 1 len = = 36 ) 
25 { 

*bname = 0; 
break; 

} 

} 

30 } else { 

if (bname) UTL_MEM_FREE(bname); 

} 
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} 

DB_HITS_SET_ATTR ( hitlist, HitsAttrFilename. hName ); 
DB_HrrS_SET_A1TR (hitlist, HitsAttrDatabase, "NONE" ); 
DB_HITS_SET_ATTR ( hitiist, HitsAttrSource. •SLN_EXPLODER" ); 
5 if(cSln&&«cSln) 

DB_HITS_SET_ATTR ( hitlist, HitsAttrQuery, cSln ); 

if ( core && *Gore ) 

DB_HITS_SET_ATrR ( hitiist. HitsAttrCore. core ); 

10 DB_HrrS_SYNC_FILE( hitlist, hName, (void*)0 ); 

retun hitlist; 

trc: UrL_ERROR_ADD_TRACE( "CrcateHitlist" ); 
return 0; 

} 



IS DumpBitsetInfo(char *bitsetName,int blndex,char *core ) 
{ 

void *bitset ; 
int numProducts ; 
inl ♦sizes= (int *)NULL ; 
20 inl *nuinUsed= (int *)NULL ; 
int i ; 

char ^masterName ; 

int masterRec ; 

diar *coreInfo ; 
25 char ♦xiString ; 

int numSites ; 

char **xNamcs ; 

char 'T)Name ; 

char *newCore ; 
30 char *cp ; 
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char buffer(1024] ; 
int **productIndexes ; 
void *c ; 

stnict mtsHitlist n::ieateHiUh^ 
5 char *fixedCorc ; 

if ( !(bitset = <S_PRIXT_BrreET_OPEN(bits^Name,bIndex))) 
{ 

fprintf(stderr, "Unable to open %s %d\n".bitsetName,bIndex); 
letum 0 ; 

10 } 

if (!CS^PRDCT_^BITSEr_GET_STATS(bitset, 

&nuniSites, 

&numProducts, 

&sizes, 

IS &numUsed)) 
{ 

fp]intf(stdeiT, "Unable to get stat on %s %d\n'',bitsetNaine,bIndex); 
return 0 ; 

} 

20 if ( numProducts = = -1 ) 

numProducts = 0 ; 
if ( ! core ) 

{ 

CS_PRDCT^BrrSET_COREJNFO(bitset, 
2S ftmasterName, 

&masterRec, 
&corelnfo , 
StxrString, 
&nuniSites, 

30 &xNames); 

newCore = Rq)lace_Y_Ox_With_Xx(coreInfo); 

else 
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{ 

newCore = Rq)lace_Y_OK_With_Xx(core); 

} 

if ( UstOpdon == Hidist ) 
5 { 

fixedCoie = DB_CT_an'_.FIX_SLN(ncwCorc,l); 
if (!(h = CTeateHidist( HidistName, *query_sln", "core_sln*))) 
return 0; 

if (!(Hitrile = fopenOfiaistName/a*))) 
10 return 0; 

/* 

** Allocate the arrays. 
*/ 

productlndexes = (int **)irrL_MEM_CALLOG(nuiiiSites,sizeof(int •)); 
15 for ( i = 0 ; i < numSites ; i++ ) 

{ 

producUndexes[i] = (int *)UTL_MEM_CALLOC(numProducts,sireof(int )); 

} 

/* 

20 ** Figure out what indexes compose a product. 
*/ 

CS_PRDCT_BrrSET_GET_HrrS(bitset,productIndexes); 

/* 

productIndexes[0][nProcessed] = Y_01 ; 
25 productIndexes[l][nProcessed] = Y_02 ; 

*/ 

c = (void *) DB_CT_CCT_GET_PRD_INIT(fixedCore, 

xrString, 
numSites, 

30 xNames); 
if(!c) 
{ 

fprintf(stderr/\nUnable to init"); 
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return -1 ; 

} 

DB_CT_CCT_GET_PRD_^PRODUCT(c, 

numProducts, 

S productlndexes, 

CallbackFunc) ; 

DB_CT_CCT_GET_PRD_^CLEANUP(c); 
DB_HrrS_CLOSE( h ); 
if ( CombNameTemplate ) 
10 UTL_MEM_FREE( CombNameTemplate ), 

CombNameTemfdate - 0; 
fclose(HitFile); 

) 

else 
15 { 

fprintf(stdout/%s %d\n",bitsctName,bIndex); 

q) = strtok(iicwCore/<"); 

bName = basename(bitsetName,NULL); 

if(cp) 

20 sprintf{buffer, 

" %s < CS^PRD_BITSET^FILE=\- %s\^CS_PM>_BITSET^OFFSET=\ 
q>,bName,bIndex): 

else 

sprintf(buffer^ 

25 "%s<CS_PRD__B^^ET_FILE==\••%s\^CS^PRD_BITSET_OFFSET=\••%d\•• > 

newCore^bName.blndex) ; 
fprintf(stdoul/ %s\n" ,buffer); 



fjprintf(stdout,*Num Products : %d of %d\n",numProducts,sizes[0]*sizes[l]); 
for ( i = 0 ; i < numSites ; i++ ) 
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fjprintf(stdout,"Nuin Y_0%d : %d of %d\n"»i+l,numUsedri],si2esIi]); 
if ( ( listOption == FuUListing ) 1 1 ( listOption DetailUsting ) ) 

CS_PRDCT.BITSET_DUMP(bitset); 
letum 1 ; 

} 



) 



CheckPointProgramO 
{ 

fpiintfCstderr/CheckPointPrograiTiO is a lonely stub in iistbitsetxiVn"); 

10 } 

int inain( argc, argv ) 

IS int argc; 
char **argv; 

{ 

int totalHits ; 
int i ; 
20 int j ; 
int Md ; 
int blnd^ ; 

char bitsetName[lQ24] ; 

char holdI81]; 
25 char *cp ; 

char ^Ret ; 

char *dir ; 

char *fullPath ; 

int n ; 
30 char *core ; 
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*•* Establish handler for a user interrupt 
*»•/ 

»gnal( SIGIhrr, UserHitControlQ; 
#ifdefSIGHUP 
5 signal( SIGHUP, UserHitControlG); 

#endif 

if( !BaiseAiguinents( argc, argv ) ) 

goto Syntax&ror, 
if ( IIIidistFile && IBitsetFile ) 
, 10 { 

(jprintfCstderr/An input (bitset or hitlist) file is rBquiredXn"); 
goto SyntaxError ; 

} 

if ( HitlistFUe ) 
15 { 

if ( !(hld = CS_HLM_OPEN_HniJST(HiUistFile)) ) 

goto UnableToOpenHitlist ; 
dir = dimame(HitlistFile); 
totalHits = CS_HLM^GET_HITS_TOTAL(hId) ; 
20 . for ( i =0 ; i < totalHits ; i+ 4- ) 

{ . 

pRet = CS_HLM_GET_HrrS(hId. i, 1); 

/* 

Grab the bitset file name and the o^set from the csln-. 

25 */ 

cp = str5tr<pRet,"CS_PRD_BITSET_FILE="); 
if(!cp) 

goto InvalidCsln ; 
q)+=20; 
30 j =0; 

while ( *cp != —) 

bitsetNameD++] = *cp++ ; 
bitsetNameQ] = 0 ; 
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cp = strstr(pRet,"CS_PRD_BITSET_OFFSEr="); 
if(!cp) 

goto InvalidCsln ; 
cp +- 21 ; 
5 j = 0 ; 

whae(*cpl= V) 

holdO++l = *cp++ ; 
holdO] » 0 : 
bindex = atcMO>ol<l); 
10 if ( dir ) 

fuUPath = 

UTL_FILE_ADD_DIR_TO_DIRSPEC(dir,bitsetNaiiie); 
dse 

MlPath = bitsetName ; 
15 core - strtok(pRet,"\n"); 

core = strtok(NULL,"\n"); 
DumpBitsetInfo(fuUPath,bIndex,cote); 

} 

} 

20 else 
{ 

n = CountBitSets(BitsetFile); 
for( i = 0 ; i < n ; i++ ) 

I>umpBitseanfo(BitsetFile,n,NULL); 

25 } 

Us^Aborted ? exit(ErrorExit) : exit(GoodExit); 
SyntaxEnror 

exit(l); 
FailureExit: 
30 UnableToOpenHitlist : 
InvalidGsln : 

exit(E!TorExit); 

} 
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CODATA 

iKnclude <^o.h> 

#incliide <signal.h> 
5 iRncIu^ <ctype.h> 

delude <unistd.h> 

SinxAvidc <string.h> 

iRnclude <sys/stat.h> 

#include <inalh.h> 
10 #include "parseopt-h" 

^include "uU^str.h" 

Sinclude "ufl_niem.h- 

^include "ufl^file.h" 

#include "utl^math.h" 
15 XTmclude "ct-h" 

#include "ctjexpr.h" 

iSteclude •ctj)rDto.h'' 

#indude "impoit^proto.h** 

#tnclude ■io_fprint.h" 

20 #include "dservTypes.h" 
i«fO 

int NucnRangeFields ; 

int NumRangeFiddsAllocated ; 

RangeStruct ^RangeFields ; 

25 int NumOneOfPiddsAIlocated ; 

int NumOneOfFidds ; 

OneOfStruct *OneOfValues ; 

float ♦*RangeValues_Y01 ; /* Actual values read in from nnn.Xl file, if MW is the first 
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and logp is the second value qiecified on die -rangevar argument list then 
RangeValues_Y01[n][0] would keep the value for MW for the nth line in the nnn.Xl file 
and RangeValues_Y01[n][l] would keq> die value for logp for that line*/ 

float •*RangeVaIues_Y02 ; /• same •/ 

5 int *H)neOfValues_Y01 ; /*Actual values read from nnn.Xl files but tnmslated into an 
index of One(XValues[i].values so we doht have to waist memory and time doing 
sticmp*^t •*OneOfVaIues_Y02 ; /* Same */ 
#endif 



int Totallnputs = 0 ; 
10 int Currentlnput = 0 ; 

FILE *fileHandles[2SS] ; 

char *InputNames[2SS] ; 

int Y_01_L«igth[255] ; 

int Y_02_Lengdi[255] ; 
15 int BitM^tartPoint[2SS] ; 

int RemainingInput(2S5] ; 



FILE *OutputFile; 



20 FILE *InputSouFceFUe; 

/* Code presumes that an int is 32 bits, ASCIl-ed into %.8x format */ 
int **Y_01; /* fingerprints */ 

int **Y_02; /• " */ 
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int *query; /* " ♦/ 

inl n Y J)l ; /* number of structures */ 

int nY^02; /* " */ 

int ♦cY__01 ; /* cardinality of fingeiprints */ 

5 int *cY.02;/* • */ 

int cjuery;/* " */ 

int *iY_01; /* intersection count of fjprints */ 

int ♦iY_02;/* * */ 

unsigned char **X_01; /* topomers ♦/ 

10 unsigned char **X_02; /* " */ 

double *iX_Oi; /* distance of topomers to selection */ 

dcMible *iX_02; /* " */ 



int 
int 

IS int 
int 
int 
int 



*Good_l; 

*Good_2; 

*r)ead_l; 

*Dead_2; 

*Good_Products; 

*Dead_Products; 



int 

20 int 
int 

double 
double 



nbits[2S6]; 

BigBits[65536]; 

setbits[8]; 

boundary[i6]; 

Dist[16K16]; 



double 
25 int 



Distance = 80.0 ; 
XytesPerFingerPrint[2] ; 



int 
int 



nProcessed = 0; 
SomeLeft; 
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Char next8[10] ^ '01234567\0"; 

cter next2[10] = "01\0-; 

char *ScreenFileNaine; 

char DefaultScreenFileName[32] = "$TA_MOLTABLES/standanl,2DRULES" ; 



long 



waste^time, trash^Ume, inTestBit, inActually; 
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A ppendix "O" 
DBjnL 

include <stdio,h> 

#inclijde <signal.h> 
5 ^include <ctype.h> 

#iiiclude <unistd.h> 

#indude <string.h> 

iKndude <sys/$tat.h> 

#include <inath*h> 
■10 ifindude "parseopLh" 

ftndude "utl_str.h* 

#indude "uU^mem.h" 

itadude "ud^fflch- 

i^ndude "uU^math.h" 
15 findude "ct,h* 

j^ndude "ct_exprh* 

#indude •ctjnoto.h- 

iiindude "impoit_proto.h" 

iKndude "io_fprint.h" 

20 iS^lude "commonData,h" /* Globals use by most funcdcms, we will dean this 

up soon */ 

static int Wbatl = -1; 

static int What2 = 0 ; 

static int (*CalcFingerPrintFunc)0; 
25 slatic int (*ActuallyConiputeFunc)0; 

extern FILE ♦debugFile ; 



int CountLines(fp) 
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FILE *fiP ; 
{ 

int i; 

char *foo; 
5 i=0; 

whUe ( -1 != UTL_SCAN_GETS( fp, "W, "r, &foo)) i++; 

iewind(^); 
return i; 
} 

10 int SdectEverything(inputSource,maxHits,whatFirst,calcFP,compu 

char ^nputSource ; 

int maxHits ; 

char *whatFirst ; 

int (^alcFP)0; 
15 int (*computeFP)0; 

int (*22^Neighbors)0; 

{ 

int cqt, clIo, q_hi, i, j, carhold, inthold, onion^ intsc; . 
doid)le max; 
20 intk; 

int Y^0l_Offset, Y_02_Offset ; 
int pos ; 

int numZapped - 0 ; 

CalcFingerPrintFunc = calcFP ; 
25 ActuallyComputeFunc = computeFP ; 

while (nPnx:essed < maxHits && ( SomeLeft > 0 ) ) 
{ 

/* 



wo 97/27559 PCT/US97/D1491 

471 

What we would like to do is first select any selections that were found 
♦* in a previous run. 
*/ 

if ( iinputSouroe 1 1 !( cjquery = SdectFromlnputFileOnputSource.query)) ) 
5 { 

if (! (cjquCTy = SelectIt(query,wbatFirst) )) return 0; 

nProcessed++; 

SomeLeft-; 

RemainingInput[CurrratInput]— ; 

10 } 

/* thai zap its ndghbors and continue! */ 
(*za|>Ndghbors)(NULL,0.&numZapped. 1 .Whatl , What2); 

} /• whUe still stuff left */ 
return 1; 

15 } 



int TestBit(bitset, bit) 
int %its^, bit; 

{ 

int what, this; 
20 unsigned char ^ytes; 

bytes = (unsigned char *) bitset; 

what = bit % 8; 
this = bit / S; 

return (bytes[this] & s^its[what] ); 

25 ) 
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int ZapInputPioductOnputlndex, whichPioduct,index 1 4ndex2) 
int inputlndex ; /* which input are we currently processing */ 
int whichProduct; /* This will be a direct index into the biunap vector */ 
int indexl ; /* if whichProduct is not given we will calcualte it */ 
5 int index2 ; /* from these two values */ 
{ 

if ( !whidiProduct ) 

whichProduct = indexl * Y_02_Length[inpuandex] + index2 ; 

whichProduct += BitMapStartPoint[inpuandex] ; 

10 FlagProduct(Dead_Products,indexl, index2, whichProduct ); 

SomeLeft- ; 

Remaininglnput[inputlndex]- ; 

} 



int FlagProduct(TheProducts, index Uindex2, this) 
15 int *TheProducts; 

int index lvindex2, this; 

. { 

int what; 

unsigned char *Products; 

20 /♦ if (DebugLevel) 

printf("%d %d, %d, %x\n",indexl,index2,this,ThePn)ducts);»/ 
Products = (unsigned char *) TheProducts; 

if {!this ) this = index l*Y_02_Length(CurrentInput] + index2; /* bit index */ 
what = this % 8; 
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Products[this] | - setbits[what]; 
relum 1; 

} 

int FlagReagent(TheReagent, size, index) 
5 int *TlieRcagait; 
int size, index; 
{ 

int what, this; 
undgned char ^Reagent; 

10 Reagent = (unsigned dm *) TheReagent; 

what = index % 8; 
this = index / 8; 
Reagent(this] j = selbits[what]; 
letum 1; 

15 } 



int SeIectFromInputFiie(inputSource,query) 
char *inputSouroe ; 
int *query ; 
{ 

20 static int ftrstTime = 1 ; 

static FILE *fp = (FILE *)NULL ; 



unsigned char *p, *q ; 
int index 1; 
int index2; 
25 int index ; 
char *line ; 
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char *q) ; 

unsigned char *qucryPtr ; 

int which ; 

inti; 

char *name = (char ^NULL ; 
char hold[81]; 
int Oldlndex ; 

int Y_01_Offset , Y^02_Offset ; 

Oldlndex = Currentlnput ; 

if ( firsfrime ) 
{ 

if ( !(fp = fopen(inputSourcc,"r"))) 

goto UnableToOpcnFiic ; 
firslTime = 0 ; 

} 

if (-1 == UTL_SCAN_GETS( fp, &line)) return 0; 

if(!( cp = strtok(line,- ")) ) 

goto UnableToParseline ; 

/♦ 

Hold on to this for now, 

*/ 

name = UTL_STR_SAVE(cp); 

if ( !( cp = strtok(NULL/ ")) ) 

goto UnableToParseLine ; 



index 1 = atoi(cp) - 1 ; 
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if ( !( = strtok(NUlX." ")) ) 

goto UnableTV)ParseLine ; 
index2 = atoi(cp) - 1 ; 



if (( indexl < 0 ) I i ( index2 < 0 ) ) 
5 goto' UnableToParseline ; 

/• 

** Must gist the input from a file that we sue processing now to work. 
*/ 

for ( Cunentlnput = -1 , i = 0 ; i < Totallnputs ; i++ ) 
10 { 
iWfO 

which = indexl * Y_02_Lemgth(iJ + index2 ; 
sprintf(hold,-%s%d", InputNames{i], which+1) ; 
if ( irrL_STO_GMP_NOCASE(name,hold) = = 0 ) 
15 { 

Cunentlnput = i ; 
break; 

} 

#endif 

20 if ( UTL_Sm_NC»^P_NOCASE(name,InputNanies[i],strien(InputNames[i])) 

= =0) 

{ 

Currentlnput = i ; 
break; 

25 } 
} 

if ( i >= Totallnputs ) 
goto Invalidlnput ; 

/* 

30 *♦ If we are reading back in a selection that might have already been filtered 
** out we better adjust our counts. 
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if (!TestBit(Dcad_Products. 

BitNfapStartPoint[Currenttnput] + 

index 1 *Y_p2_Length[CunCTtInput] + mdex2 )) 

5 { 

nProoessed++; 
SomeLeft—; 

RemaininglnputCCurraitlnput]-- ; 

} 

10 YJ)l_Offset = Y_02_Offset = 0 ; 

for ( i = 0 ; i < CurrenUnput ; i++ ) 
{ 

Y_Gl_Offset += Y_01_Lcngth(i] ; 
Y_02_Offset += Y_02_Length(i] ; 

15 } 



cjqueiy = (*CalcFingerPrintFunc)(query, 



Y_01_Off!5et 1. 



20 Y_02_Offiset ]); 



Y_01[indexl + 
Y_02(in<lex2 + 



iWfO 

p = (unsigned char Y_01[indexl]; 
q = (unsigned char *) Y_02[index2]; 
cjquery = 0; 
25 queryPtr = (unsigned char *)qu«y ; 



for (index =0;index < BytesPerFingerPrint;index+ + ,queryPtr+ +) 
i 

♦queryPtr = *p++ | •q++ ; 
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cjuery += nbits[*queryPtr & 255]; 

#endif 



Ou4nitThisHit0ndexl,tndex2); both print it and note it in bitsets 

5 Cuimtlnput ^ Oldlndex ; I"* reset this back to the brining *i 

if ( name ) 

UTL_MEM_FREE(naine); 
return c^query; 

UnableToParseLine : 
10 fprintf(stderr/UnabIe to Parse %s\n",line); 

return 0 ; 
UnableToOpenFile : 

fjprintf(stderr/Unable to open file %s\n*',inputSource); 
return 0 ; 
15 Invalidlnput : 

fprintf(stderr» ''Input %s does not match one of the -prefix filesVn", name); 
return 0 ; 

} 



CountFingerPrintBits(ftngerPrintJength) 
20 int *fingerPrint ; 
int length ; 

{ 

int i ; 

int count = 0 ; 
25 for ( i = 0 ; i < length ; i4-+ ) 
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count += nbits[fingerPrint(i] & 2SS]; 

} 

r^m count ; 

} 

5 I* Here the intent is to select the next compound "intelligently 
We try to maximize use of one or the other reagent. 

*/ 

int SelectIt(query,whatFirst) 
int *qu^; 
10 char *whatFirst ; 
{ 

intij; 

if (Whatl < 0) {GrabRandom( &i, &j, query); goto out; } 

switoh (whatFirst[0]) 
15 { 

case *0': 

GrabRandom( &i, &j, query); 
break; 

case*!': 

20 GrabThis( &i, &j , 1 , query); 

break; 

case '2': 

Gr^This( &i, &j, 2, query); 
break; 

25 } 
out: 

OutputThisHit(i,j); /* both print it and note it in bitsets */ 
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return cjquery; 

int GrabThis( pi, p2, type, fp) 
int *p\, *p2, type, *fp; 
5 { 

unsigned char *p, *g, J)ro; 
int index; 

int Y^OrOffset , Y_02_Offset ; 
int i ; 

10 intNY(0; 
int NYOl ; 



while ( Currentlnput < = Totallnputs ) 

{ 

/* 

IS *'* Process each one of the inputs walking down the data array. 
•/ 

NY02 = Y_02_LengUi[CurrentInput] ; 
NYOl = Y_01_LengthtCurrentfnputl ; 
switdi (^pe) 
20 { 



25 



case 1: 



if (!fmdOne(Diead_Products, Whatl»NY02, I, NY02) && 
!findOne(Dead_F»n)ducts, What2 , NY02, NYOl) && 
!GrabRandoin( pi, p2, fp) ) 

Currentlnput++ ; 
continue ; 
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break; 

case 2: 

if (!fmdOne(Dead_Products, What2 . NY02, NYOl) && 
!findOne(Dead_Products, Whatl*NY(^ 1, NY02) && 
5 IGrabRandomC pi, p2, fp) ) 

{ 

Cuncntlnput++ ; 
continue ; 

} 

10 break; 

} 
/• 

** If we are at the end of this input set, we need to advance to the 
** next one. 
15 •/ 

if ( ( What! > = Y_01_Length[Cunwianput] ) 1 1 

( What2 > = Y_02_Length(Cunenttnputl ) ) 

{ 

Whatl = 0 ; 
20 Whal2 =0; 

Currentlnput++ ; 
continue ; 

} 

break; 
25 } 

if ( Currentlnput = = Totallnputs ) 
return 0 ; 

*pl Whatl; *p2 = What2; 

Y_01_Offset = Y_02_Offset = 0 ; 
30 for ( i = 0 ; i < Curraitlnput ; i++ ) 
{ 
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Y_01_Offset += Y_01_Length[i] ; 
Y_02_Offset += Y_02_Length(i] ; 
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cjqueiy = (*CalcFingerPrintFuncKip> 

5 

Y_01_Offset ], 
Y_02_O£fset ]); 

mo 

10 pro = (unsigned char *) fp; 

p = (unsigned char •) ( Y_01(Whatl + Y_01_Offset] ) ; 
q = (unsigned char *) ( Y_02(What2 + Y_02_Offset] ) ; 

cjquery = 0; 

for (indoc «0;index < BytesPerFingerPrint;index + 4- ,pro+ +) 
15 { *pn) = *p++ I *q++ ; 

c_query + = nbits(*pro & 255); } 

#endif 
return 1; 

} 



Y_01[Whatl + 
Y_02[What2 + 



20 /* This can be done more ^cioitly when we KNOW we are walking a vector */ 
int findOne(bitset,stait,incr,niax) 
int *bits^ staitrincr, max; 

{ 

Static int oldstart = -1234, 
25 oldinCT, 
oldj; 

int i; 



if ( (start != oldstart) 1 1 (incr != oldincr) ) oldJ = -1 ; 
oldstart = start; oldincr = incr; 
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oldj ++; 

Start + = incr * old_i; 

for (i=oldJ;i<inax;i++, start += incr) 

{ 

5 if ( TestBit(bitset, BitMapStartPoint[CurTentInput] + start)) continue; 
Whatl = start / Y J)2_Lengtii[CunientIiqiut] ; 
What2 = start % YJ>2_Lcngth[Currenanput] ; 
oldJ = i; 
return 1; 

10 } 

oldstart = -1234; 
return 0; 

} 



15 •*+£: 

mm 
mm 

**** Abstract : Function will randonily select a product from the current 
** input file, if there are no more selections left in the 

20 ** current input then the next one is searched. 

Currently we deal with two reaction sites(two points of 
** variability Yl and Y2). A product is one of possible 

combination of Y1.Y2, 
25 The bit maps that are used to track the selections/eliminated 

products are a vector of length numY01*numY02, where every 
bit represents a product. Since we are dealing with multiple 
sins then we just use one bitmap vector and string the 
rq>resentations for a set of product together. 



mm 
mm 
mm 
mm 

30 ** 



mm 



I reaction l(cslnl) products][reaction2(csln2) products] 
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where each reaction products are layed out in a row niajor 
** fonnat : 

mm 

5 ** Yl(0).Y2(0), yi(0).Y2(l) .. |YI(1)Y2(0),Y1(1)Y2(1)... 



«4c 



** Usage : 
♦* 

*♦ Returns : I on success, 0 for fisulure. 
10 ** 

** Algorithms : None. 



mm 



Re\rision History 



IS Modified to work with multiple csln 

** processing and documented. 

*/ 

20 int GrabRandom( pi, p2» fp) 
int ♦pi, *p2, *fp; 

{ 

int index, sum; 
int i ; 

25 int valueU value2; 

unsigned char *p, *q, *pn); 

int byteOffset, bitsToSkip ; 
int Y^Ol^Offset , Y_02_Offset ; 
fjptf; 
30 /* 

*♦ Lets start at the begining portion of the bitmap for this input, our products 
♦* are layed out like : 
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global varaible Currentlnput tells us which reaction we are look at now 
and BitM^tartPointQ has the starting pcnnts in the bitnoap for each 
5 ** input 

/* 

We know how many products are left in the current input, if we get to 
zero here we need to move on to the next input set. 

10 */ 

for ( i = Currentlnput ; i < Totallnputs ; i + + ) 
if ( RemainingInput[CurTentInput] > 0 ) 
break; 

dse 

15 Currentlnput++ ; 

if ( Currentlnput > = Totallnputs ) 
return 0 ; 

/* 

Figure out which byte in the bitmap the products for this input start. 

20 **/ 

byteOffset ^ BitMapStartPoint[CurrenUnput] / 8 ; 

p = (un^gned char *) ( Dead_Products ); 
p += byteOffset ; 

index = (( f = UTL_MATH_RANDO) * RemainingInput[Currenanputl) + 1 ; 
25 valuel = sum = 0; 



while (sum < index) 
{ 

sum += nbits( -'(*p++) & 255]; 
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value! +=8; 

} 



p -= 1; sum -= nbitst -(*p) &255 J; valuel -= 9; value2 = (~(*t>) & 255); 
while (sum < index) 
5 { 

valuel 
]f(valud2 & 1) 

sum++; 
value2 = value2 > > 1; 
10 } 
/* 

** We found where our randoin(not sdected) product is, now we need to go 
back so many bits to be able to translate the address in a one dimmtionai 
bitmap vector into a 2D index. 
IS (This is becuase our bitmap rq>resentation for this input did riot start 
from 0(or a byte boundary). 

*/ 

bitsToSkip = BitMapStartPoint[Currentlnput] - ( byteOffset * 8 ) ; 



valuel -= bitsToSkip ; 
20 What2 = ( valuel ) % Y_02_LengthtCunentInputl; 
Whatl = ( valuel ) / YjQ2_Lengtfi[Curraianput] ; 

*pl =:= Whatl ; 
*p2 - What2 ; 
/* 

25 ** Find out where the values for this product is. 
*/ 

Y_0LOffset = Y_02_Offset = 0 ; 
for ( i = 0 ; i < Currentlnput ; i+ + ) 

{ 

30 Y_01_Offset += Y_01_Length(i] ; 
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Y_02_Offsrt += Y_02_Lengthri] ; 

} 



c_query » (*CalcFingCTPrintFunc)(fip, 

5 Y_01_OffsetJ, 

Y_02_Offset ]); 
#ifO 

pro = (unsigned char *) fp\ 

10 p = (unsigned diar •) ( Y_01[Whatl + Y_01_Offset ) ) ; 

q = (unsigned char *) ( Y_02IWhat2 + Y_02_Offtet ] ) ; 

/* 

Calculate the ai^Hoximate fingure print by ORing the fingure print 
•* for the two pieces. 
15 */ 

cjqucfy = 0; 

for (index sO;index <BytesPerFingerPrint;index+ +,pro+ +) 
{ *pro = *p++ I *q++ ; 

c_qu«y += nbits(*pn)&255]; } 

20 fTendif 
return 1; 

} 

int Ou4MitThisHit( index 1, index2) 
int index 1, index2; 
25 { 

int which; 



Y_01[Whatl + 
Y_02[What2 + 



which = index 1 * Y_02_LengthICurrentInput] + index2; 
^rintf(OutputFile,"%s%d %d %d\n", InputNames[Cunrenanput], which+1, 
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indexl + 1 ,index2+l); 
which = BitMapStartPointlCurrenanput] + index 1 *YJ)2_Length[CunentInput] 
+ihdex2 ; 

FIagPnxluct(Good_Pro(iucts,0,0, which); 
5 FlagProduct(Dead_Products,0,0, which); /* can <mly be selected once */ 
note use of reagents; this is slightly wasteful of time ^/ 
FlagReagent(Good_l, nY^Ol, indexl); 
FlagReagent(Good_2, nY_02, index2); 

ffludi(OutputFilc); 
10 return 1; 

} 

int DumpBitSet(bitSet,offset,numY01,numY02) 
int*bitSet; 
int of^ ; 
15 int numYOi ; 
int numYQ2 ; 
{ 

int i , j ; 

unsigned char ^Products ~ (unsigned char *)bitSet ; 
20 int pos ; 
int byte ; 
int bit ; 
int indexl ; 
int index2 ; 



25 fprintf(stderr, "\n Y_02 \n 



for ( i 0 ; i < numY02 ; i-h+ ) 
^rintf(stderr/ %3d %i+l); 
fprintf(stderr, "\n \n 
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"); 

for ( i = 0 ; i < (numYOl * nuinY02) ; i++ ) 
{ 

indexl = i / numY02 ; 
index2 = i % numY02 ; 



byte = ( i + offset ) / 8 ; 
bit = ( i + offset ) % 8 ; 

if (( ( i % numY02 )==0) ) 
10 fprintf(stderr,"\n%3d |-,indexl + l); 

fprintf(stdeiT/ %3d ",(Products[bytel & setbitslbit])?l:0 ); 

fpiintf(stderT, "Xn \n"); 



15 int DumpValuesCmputSet,nutn YOl ,numY02,computeFunc) 

int inputSet ; 

int numYOl ; 

int numY02 ; 

int (*computeFunc)0; 
20 { 

int i , j ; 

int pos ; 

int byte ; 

int bit ; 
25 int indexl ; 

int index2 ; 

int onion ; 

int intsc ; 

double max ; 
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int Y2_Offiset ; 
int Yl_Offset ; 

£printf(stderr,*\n Y_02 '■ -\n 

"); 

5 for (i = 0;i < numVOZ ; i++) 
^rintf(stdefT,' %5d ".i+l); 
fyrintf(stdar,'\n ■ \n 

"); 

for ( Yl_Offset = Y2_0ffset = i = 0 ; i < inputSet ; i++ ) 
10 ( 

Yl_Offset += Y_01_LengthDl; 
Y2_Offset += Y_02_Lengthnj; . 
} 

for ( i = 0 ; i < (numYOl * numY02) ; i++ ) 
15 { 

index! = Yl_Offset + i / numY02 ; 
indcx2 = Y2_0ffset + i % numY02 ; 

(*coinputeFunc)( indexl, index2, &omon, &intsc, &max); 

if (( ( index2 - Y2_Offset )= = 0 ) ) 
20 fprintf(stderr,"\n%5d | ".indexl + 1- Yl_Offset ); 

^rintf(slderr," %0.3f ".max); 

} 

^rintf(stdcrr,"\n An"); 



Dump6itSet(Dead_Products,BitMapStartPoint[inputSet],numY01.numY02); 
25 } 
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Appendix "P* 

#include <stdio.h> 

#include <dgnal.h> 
S delude <ctype.h> 

include <umstd.h> 

iKndude <string.h> 

#include <sys/stat.h> 

iKnclude <inath.h> 
10 innclude "parseopt.h** 

#iiidude "ua^str.h" 

#iaclude "uU^mem.h" 

jWnclude •ua_file.h'' 

#indude "utl^math.h" 
15 #indude "ct-h" 

jftnciude "ctjexpr.h" 

#indude •ct_proto.h" 

iS^ndude "importjjroto.h* 

jKnclude *io_Q>rint.h" 

20 ftndude "dservTypes-h" 

■ -** 
** 

25 ** Abstract : Function zapps products who are in the same neighborhood 
** as the SLNs in the Unity hitlist files. 



30 



** Usage 
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** Returns : 1 on success, 0 on error 

mm 

*♦ Algorithms : None. 

mm 

5 ** Revision History : 

mm 
mm 

*/ 

10 int QiminateProductsFromDatabase(DatabaseNames,numEliminated,zapNei^^ 
char *I>atabaseNames ; 
int *numElinunated ; 
int (fz^NdghborsK); - 
{ 

15 inti; 
char *cp ; 

struct loDataBase ^database = 0 ; 
soiict loFingerPrint *fingerPrint = 0 ; 
struct loHngerPrintlnfo fjprintlnfo; 
20 struct loFprintDef ♦fprintDef=0; 



int fingerPrintFile ; 

int bytesPerPingerPrint ; 
long sinid ; 

char ^databaseDirectory ; 

25 dm '^databaseName ; 

int num2[app^ = 0 ; 

int c_query ; 



"^umEliminated = 0 ; 
if ( DatabaseNames ) 
30 { 

cp = strtokCDatabaseNames," 
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** Open the database^ 
5 •/ ^ 

databasePirectory diniaine( cp ); 

databaseDirectory ^ dalabaseDirectory?databaseDirectory:''."; 
databaseName = basaiaine(cp,(char^)0); 

if ( !( database = DB_IO__DBSTART_USER_PSWD( 

10 databaseDirectory, 

databaseName, 
V, 

NULL, 
NULL, 

15 0, 

0))) 

goto UnableToOpenDatabase ; 

if( !(fiprii)tDcf = DBJO_FPDEF_VREAD( database, "standard" ))) 
goto NoSuchScreen; 

20 if ( !DBJO_FPRINT_GEnNFO( database, 

^printDef- > fpdFprintDir , 

fprintDef- > fpdFprintFileName, 

&^rinanfo ) ) 

goto UnableToGetScreenlnfo ; 
25 if( !(fingerPrintFile = DBJO_FPRINT OPEN( database, 
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^rintDef- > fjpdFprintDir, 
fjprintDef- > fjpdFprintFiIeName ))) 

goto NoSuchScreen; 

bytesPerFingerPrint = fyrintlnfo.fpFingerLragdi / 8 ; 
5 fjprintf(stderr,*Processing database %s ftdVn'.q^^bytesP^FingerPrint); 

if ( (fprintlnfo.fpFingerLength % 8 ) ) 
bytesPerFingerPrint+ + ; 

fingerprint = (struct loFingerPrint ♦) UTL_MEM_ALLOC ( 

sizeof (*fingerPrint) + bytesPerFingerPrint + 1 ); 

10 

Read all compounds in the database. 

*/ 

for ( slnId=fjprinanfo.^tartSlnNo; 

slnld < =fprintInfo.fpLastSlnNo; 
15 slnld+-f ) 

{ 

if ( !DBJO_FPRINT_READ ( database, 

fihgerPrintFile, 
Oong) slnld, 

20 fingerprint)) 

g<Ho 

UnableToReadFromDatabase; 
/* 

** Zap all the neighbors in the current run. 
25 */ 

cquery = CountFingerPrintBits(&fingerPHnt- > fjpPrint, 

bytesPerFingerPrint); 

^rintf{stderr,"Reading sin %d\r", slnld ++); 
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(*z^NeighborsK&fingerPrint- > fpPrint,c_query,&numZajq)ed,0,- 1 ,-1); 

C^umQiininaled) nuinZ^)ped ; 

} 

/* 

5 •* CLOSING DATABASES TRASHES MEMORY. DO THIS AFTER YOU CAN 
SPEND SOME 

*• TIME TO DEBUG THIS PROBLEM. 

*• F.S. 05-14-96 

*/ 

10 jSWO 
/* 

*** Close the database and do it again! 
•/ 

DB_IO_DBCLOSE(database); 

15 /• 

Get the next database to process. 

*/ 

if ( fingerprint ) 

UTL_MEM_FREE( fingerPrint ); 
20 DBJO_FPRINT_CL0SE( database, fingerPrintRle ); 

jfendif 

q) = strtok(NULL." "); 

} 

} 

25 fiprintf(stderr,"\n"); 



return 1 ; 

UnableTbOpenDatabase : 

fiprintf(stderr, "Unable to open database %s\n",q)); 
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goto Error ; 
NoSuchScreen: 

fprintf(stderr/Unable to open screen 'standard'Xn*'); 

goto Error ; 
5 UnableToGetScreenlnfo: 

fpiintf(stderr, "Unable to read screen infonnationXn''); 

goto Error ; 
UnableToReadFromDatabase: 

fprintf(stdOT/Unable to read fingerprint for sin id %d\n'',slnld); 
10 goto Error ; 

Error : 

return 0 ; 

} 

15 **+E: 

** ' . 

** 

Abstract : Function zapps products who are in the same ndghborhood 
** as the SLNs in the Unity hiUist files. 

20 ** 
** 

** Usage : 

^* Returns : 1 on success, 0 on error 
25 ** 

Algorithms : None. 

mm . 



30 



**-E: 
•/ 
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int EUiniiiateProductsFromHiUist(HiUistNames,screenNanm,num 
char ^HitlistNames ; 
char ^screenName ; 
int ^^umEliininated ; 
5 int (*rj^>Ndghbors)0; 
{ 

int i ; 
char *cp ; 

struct loFingerPiint *fingerPrint = 0 ; 
10 FILE *fingerPrintFUe = 0 ; 

int bytesPerFingerPrint ; 

long slnld ; 

char ^databaseDirectory ; 

char *databaseName ; 

15 int numZapped = 0 ; 

struct CtConnectimTable *ct; 

int nBitsSet; 

diar *sln ; 

FILE *handle ; 

20 int c_query ; 

int *ScreenStructure; 

"hiumEliniinated = 0 ; 
if (IHittistNames) 
return 1 ; 

25 /* 

** Read in the screen information first. 
*/ 

if (!(fingerPrintFile = UTL^FILE^^FOPENCscreenName/r"))) 
goto UnableToOpenFingureprintFile ; 
30 ScreenStnicture = (int *) DB_BIT2_PARSE_2DSCREEN(fmgerPrintFile); 

UTL^FILE_FCLOSE(fingerPrintFile); 
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if (iSciemStnicture) 

goto UnableToReadScreenStnicture ; 

bytesPerHngerPrint = DB_Bm_GET.SIZE( ScreenStructure ); 

fingerprint = (struct loFingerPrint ♦) UTL_MEM__AIXOC( bytesPerFingerPrint); 

5 ^uniEliniinated = 0 ; 

if ( mtlistNames ) 
{ 

cp = strtokCHitlistNames/ •); 
while (cp) 
10 { 

slnld = 0 ; 
^rintf{stderr,*Procesang hitlist %s\n",q)); 
/* 

Open the database. 

15 */ 

if ( !(handle = fq>en(cp/r")) ) 
goto UanbleToC^nHitlist ; 

/* 

Read all the hits in the hitlist. 

20 */ 

while ( UTL_SCAN^GETS ( (FILE ♦) handle, "\\^ &sln) != -1 

) 

{ 

if (!(ct = DB_IMPORT^SLN(sln))) 
25 goto UnableToGetCtFromSln ; 

memset ( fingerPrint, 0, bytesPerFingerPrint ); 
if( !DB_BIT2_EVALUATE( ct, 

ScreenStructure, 
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fingCTPrint, 
&nBitsSet )) 
goto UnableToGenerateFingerpiint ; 

/♦ 

S Tap all the ndghbors in the current nui. 
♦/ 

cjquery - 0 ; 
c_query = 

CountFing^PrintBits(&fingerPiint- > fyPrint,bytesPerFingerPrint); 
10 lprintf(stctor/Readingsln %d\r,slnld++); 



(*zapNeighbors)(&fingerPrint- > fpPrint,c_qu^,&numZapped,0,-l ,-1); 

X^numHiminated) numZapped ; 

} 

/* 

IS Close the database and do it again! 
*/ 

fcIose(handle); 

/* 

** Get the next hitlist to process. 
20 •/ 

q) = strtok(NULL/ 

} 

} 

fprintf(stderr/\n*); 



25 if ( fingerprint ) 

UTL_MEM_FREE( fingerPrint ); 



return 1 ; 
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UnableToOpenFingurqmntRIe : 

fprintf(stderr,*Unab]e to open fingure print file %s\n",screenNaine); 

goto Error ; 
UnableToReadScreenStructuie : 
S ^printf(stderr/Unable read screen info for %s\n*,screenNanie); 

goto Error; 
UanbleToOp^iHiflist : 

fprintf(stdOT, "Unable to open hitlist ftsKn'^q)); 

goto &ror ; 
10 UnableToGenerateFingerprint: 

fprintfCstderr/Unable to generate fingureprint for \n%s\n*',sln); 

goto Error ; 
UnableToGetCtFromSln: 

fprintf(stderr, "Unable to generate ct for \n%s\n",sln); 
15 goto Error ; 

Error : 

return 0 ; 

} 
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A ppendix "O* 
FILTPR 

^include <stdio.h> 

jKnclude <signal.h> 
S ^include <ctype.h> 

include <unistd.h> 

/include <string.b> 

include <sys/stat.h> 

#include <inath.h> 
10 j$^nclude "parseopth* 

jWnclude "utl^str.h* 

delude •utl_niem.h- 

#mclude •iitt_file.h" 

Anclude •utl^math.h" 
15 #include •cth" 

jRnclude "ct^expr.h* 

#include "ctjiroto.h- 

#uiGlude "iraportjTOto.h" 

^include "io_fprint,h" 

20 ^include •dservTypes.h'* 



: Function parses raiige field string for ADS design programs. 
It takes a string of the form 

"iogp -1.0 8,0 MW 200 500 price 6 12.50" and fUls in the 
global array RangePields. 



/♦ 
m* 

25 ** Abstract 
«* 

mm 

■mm 
mm 

30 
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*♦ Usage : 

Returns : 1 on success, 0 for failure. 

mm 

5 ** Algcmthms : None. 

mm 
mm 
mm 

**-E: 
10 */ 

int ParseRangeVar(xangeVar,numRangeFiddsAllocated»numRangeFt^^^ 
char *rangeVar ; 
int ^umRangeFieldsAllocated ; 
int "^umRangeFields ; 
15 struct RangeStruct ♦'•tangcFields; 
{ 

Uatic int stat = 0 ; 
char *buffer = (char *)NULL ; 
char *name ; 
20 char *iow ; 
char *high ; 
int i ; 

^umRangeFiddsAllocated = 0 ; 
'*^umRangeFields = 0 ; 
25 ^geFidds = (struct RangeStruct ♦)NULL ; 

if ( !(buffer = UTL_STR_SAVE(rangeVar)) ) 
goto Failure ; 

name - strtok(buffer/ 
while ( name ) 
30 { 
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if ('(low = strtok(NULL,- •)) ) 

goto UnableToParse ; 
if ( '(high = strtok(NULL," ")) ) 

goto UnableToParse ; 
5 if ( *1iuinRangeFields > - *numRaitgeFiddsAllocated ) 

{ 

if ( !*iangeFields ) 
{ 

if (!(*iaiig^elds = (struct RangeStnict 

10 *)UTL_MEM_CAUXX:( 
ALLOCATE_INCREMENT, 

azeof(stnict RangeStnict)))) 
goto Failure ; 

15 dse 

*numRangeFiddsAIlocated = 

ALLGCATEJNCRENffiNT ; 
} 

dse 

20 { 

if (!( *nuigeFields = (struct RangeStnict 

•)irrL_MEM_RECALLOC( 

rangeFields, 

(*numRangeFieldsAllocated '■'sizeofCstnict RangeStnict)), 
25 ((♦numRangePiddsAllocated + ALLCXIATEJNCREMENT) ♦ 

sizeof(stnict RangeStnict)) )) ) 
goto Failure ; 

else 

*numRangeFieldsAllocated + = 

30 ALLOCATE_INCREMENT ; 



} 

} 
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('^geFidds)I*nuinRangeFidds].RangcFieldName = 
UTL_STR_SAVE(iiaine); 

(*rangeFiclds)(*numRangeRdds].lowValue = atofOow); 
(*rangeFidds)(*numRangeFidds],highValue = atof(high); 
5 (*nujnRangeFidds)++ ; 

name = strtok(NULL," •); 

} 

Stat = 1 ; 
goto Cleanup ; 

10 - UnableToParse: 

fprintf(stderr/Unable to parse -rangevar %s\n'',rangeVar); 
Stat = 0 ; 
goto Cleanup ; 
Failure : 
15 Stat = 0 ; 

goto QeanUp ; 
QeanUp : 

if ( buffer ) 

UTL_MEM_FREE(buffer); 
20 letum stat ; 

} 

/* 

•♦+E: 
25 ♦* 

Abstract : Function parses one of field string for ADS design programs. 
** It takes a string of the form 

** •supplier Aldrich,Sigma,Fluka,SALOR taste SWEET,Salty" 

** global array OneOfValues. 

30 ** 

mm 
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** Usage 

mm 



Returns : 1 on success, 0 for failure. 



mm 



5 *♦ Algorithms : None. 



10 */ 

intParseCtaeOfVar(oneOfVar,numOneOfFiddsAllocated,numCHieOff 
char *oneOfVar ; 

int ^umOneOfFiddsAUocated ; 
int ^umOneOfFields ; 
15 struct OneOfStnict **oneOfValues; 
{ 

static int Stat = Q ; 

char *buffer = (char *)NULL ; 

char ^'name ; 
20 char ^choices ; 

char *choice ; 

int i ; 

int j ; 

char *cp ; 
25 char *eml ; 

*numOneOfFieldsAIlocated = 0 ; 
♦numOneOfFieids = 0 ; 

(*oneOfValues) = (struct OneOfStnict *)NULL ; 

». 

if ( !(buffer = UTL_STR_SAVE(oneOfVar)) ) 

30 goto Failure ; 

/* 

** Start off by reading the field name , 
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*/ 

name = strtokOiuffer," *); 
while ( name ) 

{ 

5 if ( *numOneOfFidds > = *nuinOneOfFieldsAIIocated ) 

{ 

if ( !(*oneOfValues) ) 
{ 

if ('(^oneOfValues = (struct OneOfStnict 

10 *)UTL_MEM_CALLOC( 
ALLOCATEJNCREMENT, 

sizeof(stnjct OneOfStnict)))) 
goto Failure ; 

15 else 

^numOneOfFieldsAlIocated = 

ALLOGATE_INCREMENT ; 
} 

else 

20 { 

if (!( ♦oneOfValues = (struct OneOfStnict 

*)UTL_MEM_RECALLOC( 

*oneOfValues, 

(*numOneOfFieldsAliocated *si2eof(struct OneOfStnict)), 
25 ((♦numOneOfFieldsAUocated + ALLOCATEJNCREMENT) * 

sizeof(stnict OneOfStnict)) )) ) 
goto Failure ; 

else 

^numOneOfFieldsAllocated + = 

30 ALLOCATE^INCREMENT ; 



} 

} 
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(*oncOfValues)['TiumOneOfFidds],OneOfFiddName = 
UTL,STR_SAVE(name); 

(*oneOfValues)[*numOneOfFidds].nuinValues = 0 ; 

('*'cmeQfVa]ues)['*1iumOneQfFidds].niunValuesAUoc = 
5 AUjOCATE^INCREMHfT ; 

if ( !((*cmeOfValues)[*num(^ieOfFidds],values = (char **) 

irrL_MEM_CAUjOC(AIiOCATE_INCREME^ 

sizeof(char *)) ) ) 
10 goto Failure ; 

/* 

** Now look at the choices this field could have. 
♦/ 

choices = strtok(NULL,* "); 
15 if ( !choices ) 

goto UnableToParse ; 
choice = strtok(choices,"/); 
while { choice ) 
{ 

20 if ( (*oneOfValues)[*numOneOfFields].num Values > = 

(*oneOfValues)[*nuinOneOfFidds].numValuesAnoc ) 

{ 

if ( !((n>neOfValues)[*nuinOneOfFields].values = {char **) 

UTL_MEM_RECALLOC((*oneOfValues)[*numOneOfFidds].values, 
25 ( 
(*oneOfValues)[*nuinOneOfFields].numValuesAlloc * 
sizeof(char *)), 
( ((*oneOfValues)[*numOneOfFields].numValuesAlloc + 

ALLOCATE JNCREMENT ) 

30 *si2eof(char *)) ) )) 

goto Failure ; 
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(*oneOfValues)[*numOneOfFields].nuinValuesAlloc+ = 
ALLOCATEJNCREMENT; 

} 

(^meOfValues)I*numOneOfRddsl,values[(*cmeOfValues)[*n^ 
5 = UTL_STR^SAVE(choice); 

(*oncOfValues)[*numOneOfFiddsl,numValues4-+ ; 
end = choice + strlen(clKMce) + 1 ; 
choice = stitok(NULL//); 

} 

10 (*nufnOneOfFields)+ + ; 

name = strtok(end/ 

} 

Stat = 1 ; 
goto Cleanup ; 

15 UnableToParse: 

fiprintf(stderT,"Unablc to parse -oneof %s\n",oneOfVar); 
Stat = 0 ; 
goto Cleanup ; 
Failure : 
20 Stat = 0 ; 

goto CleanUp ; 
Cleanup : 

if (buffer) 

UTL_MEM_FREE(buffer); 
25 return stat ; 
} 
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Abstract : Function parses a line from the input file and extracts 
out any rangevar or oneof fields. 

5 ** 
** 

Usage : 
** Returns : Always returns 1 ; 

10 

""^ Algorithms : None. 

mm 
** 
** 

15 **-E: 
*/ 

int ReadUneAttributes(line,numRangeFields,rangeValues,rangeFields,numOneOfFields, 

oneOfValues,oneOfFieIds) 

char *line ; 
20 int numRahgeFields ; 

float **rangeValues; 

struct RangeStnict ^tangePields; 

int numOneOfPields ; 

int ♦*oneOfValues; 
25 struct OneOfStruct ♦oneOfFields; 

{ 

int i ; 
intj; 
char *cp ; 

30 /* 

** Now read in the salar selection fields if any. 
*/ 

if ( numRangePields ) 
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{ 

if ( !(*rangeValues = (float *)UTL_MEM_CALLOC(numRangeFields, 

sizeof(float)) } ) 
S return 0 ; 

} 

if ( numOneOfF^ds ) 
{ 

if ( !(*oneOfValiies = (int *)UTL_MEM_CALLOC(numOneOfFidds, 

10 

sizeof(int)) ) ) 

return 0 ; 

} 

for ( i = 0 ; i < numRangeFields ; i++ ) 
15 { 

if ( ( q) = strstr(line,iangeFiddsn].RangeFiddName ) ) ) 
{ 

/• 

*• Move past the logp= to get the value of this field, if the value is 
20 •* a ';• then it is a missing value. 
•/ 

cp = cp + strIen(rangeFieIds[i].RangeFieldName) + 1 ; 
ifCcp==V) 

(♦rangeValues)[i] = MISSING_FLOAT_VALUE ; 

25 else 

(•rangeValues)[il = atof(cp); 

} 

else 
{ 

30 (•rangeYalues)[i] - MISSING_FLOAT_VALUE ; 

} 

} 
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** Parse the -oneof field, we are looking for something looking like 

** "«jpplicr=Aldridi'' 

*/ 

for ( i = 0 ; i < numOneOfFidds ; i++ ) 
5 { 

if ( ( cp = strstr(line,oneOfFieldsril.OneOfFieldNaine ) ) ) 
{ 

q) = cp + strlen{oneOfFields(i].OneOfFieldName) + 1 ; 
if(*cp== V) 

10 (*oneOfValues)IiJ = MISSING JNT^VALUE ; 

else 

{ 

for ( j = 0 ; j < oneOfFields[i]-num Values ; j++ ) 
{ 

15 if ( UTL_STR_NCMP_NOCASE(cp, 

oneOfFidds[i]. values[j] , 
strlen(oneOfFieIds[i).values01)) ==0) 
{ 

(*bneOfVaIues)p] = j ; 
20 break; 

} 

} 

if ( j s= oneOfFidds[i]-numValues ) 

(*oneOfValues)ri] = NOT_A_MATCH_VALUE ; 

25 } 
} 

else 

(♦oneOfValues)(i] = M1SSING_INT_VALUE ; 



30 } 
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**+E: 
«* 

*« 

Abstract : Function Checks to see if the given product passes the 
5 »* user supplied fUtws. 

** 

mm 

** Usage : 
mm 

10 ** Returns : 1 if the product is not within range, 0 otherwise. 
** Algorithms : None. 

mm 
mm 

15 

♦/ 

static int 

N<rtWithinScalarRange(firstIndex,secondIndex,numRangeFields,rangeVaIues_Y01,^ 

ues_Y02,rangeFidds,numOn^fFidds,oneOfValues_Y01,<meOfValues_Y(K 
20 int firstlndcx ; /* Index into Y_01 data */ 

int secondlndex ; /* Index into Y_02 data */ 

int numRangeHelds ; 

float **rangeValues_Y01 ; 

float **rarigeValues_Y02 ; 
25 struct RangeStruct ^rangeFidds; 

int numOneOfFields ; 

int ♦*oneOfValues_Y01 ; 

int **oneOfValues_Y02 ; 

struct OneOfStruct »oneOfValues ; 
30 { 

int i ; - 

float total ; 
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/• 

** First check the range values. 
*/ 

for ( i = 0 ; i < numRangeFidds ; i++ ) 
5 { 
/• 

** If one of Ihe r^ons has a missing value, then we^lo-not filter this 

*♦ product 

•/ 

10 if ((( rangeValues_Y01[rirstIndex)[i] - M1SSING_FL0AT_VALUE ) 

= = SMALL_FLOAT) j| 

({ rangeValues_Y02[secondIndex]ri] - M1SSING_FL0AT_VALUE ) 
= = SMALL_FLOAT)) 
return 0 ; 

15 total=rangeValues_Y01[firstIndex][i] + rangeValues_Y02(secondIndex][i]; 

if ((total > rangeFields[i]. high Value ) 1 1 
(total < rangeFieldspJ.iowValue ) ) 

{ 

return 1 ; 

20 } 
} 

for ( i = 0 ; i < numOneOfFields ; i+ + ) 
{ 

/♦ 

25 If the value is missing then we dont mess with this guy. 
*/ 

if ( ( oneOfValues_Y01[finandexJ(i] = = MISSING_INT_VALUE ) 1 1 
( oneOfValues_Y02[secondIndex]ri] = = MISSING_INT_VALUE ) ) 
return 0 ; 

30 /* 

** If any of the r^ions in the product does not match the selection 
** criteria, then the product is rejected. 



15 
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if ( ( oneOfValues_Y01[firsandex][i] == NOT^^MATCH^VALUE ) 1 1 
(oneOfVaiues.Y02[seoondIndex][i] == NOT_A_MATCH_VALUE ) ) 
return 1 ; 



return 0 ; 



**+E: 

10 



Abstract : Function zapps products who are not within the user supplied 
** selection criteria. 



Usage 



Returns : 1 on success or 0 on failure. 



20 ** Algorithms : None. 



*/ 

25 int FilterPFoducts(inputInfo,rangeData,oneOfI>ata,numFiltered) 

struct InputlnfoStruct ^inputlnfo ; 

struct RangelnfoStruct '^'rangeData ; 

struct OneOflnfoStruct *oneOfData ; 

int *numFiltered ; 

30 { 

int numProducts ; 
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int i ; 

int indexl ; 

int index! ; 

int Y0t_Offset = 0 ; 

int Y02_Offset = 0 ; 

:intk ; 

"^numFiltered = 0 ; 



for ( k = 0 ; k < inpuanfo->totalInputs ; k++ ) 
{ 

10 numPioducts = inputInfo->Y_01_Length(k] * inputInfa-> Y_02_Length[k] 

for ( i = 0 ; i < numProducts ; i+ + ) 
{ 

indexl = i / inpuUnfo-> Y_02_Length[k] ; /*Y_01 index */ 
15 index2 = i % inpuanfo->Y_02__Length[kl ; /*Y_02 index */ 

if ( NotWithinScalarRange(indexl, 
index2, 

rangeData->numRangeFields , 
rangeData->iangcValues_Y01 + Y01_Offset , 

20 rangeData->rangeValues_Y02 + Y02_Offset , 

rangeData- > langeFields, 
oneOfData- > numOneOfPields , 
oneOfData->oneOfValues_Y01 + YOl^Offset , 
6neOfData->oneOfValues_Y02 + Y02_Offset , 

25 oneOfData- > oneOfFieids )) 

{ 

ZapInputProduct(k,i,0.0); 
/* FlagPioduct(dead__Products,0,0,i + bitMapStartPointfk] ); */ 

/* someLeft-; 
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rcmaininglnput(k]- ; */ 

*numFiltered += 1 ; 

} 

} 

Y01_Offset += inputInfo->Y_Ol_Length(k] ; 
Y02_C)ffset + = inputlnfo- > Y_02_Length(k] ; 

} 

r^um 1 ; 
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** Author Date Description 
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Appendix 'R' 

/*+M: dbcsln_bitsetc 

«* 

5 



'^^ Fred Soltanshahi 07-30-96 Routines to access a CbemSpace 
10 ** bitset. 

** Entry Points : 

CS_PRDCT_^BrrSET_OPEN0 - Opens a CS bitset file. 

** CS_PRDCT_BrrSEr_CLOSE0- Closes and cleansup after a bitset file. 

CS_PRDCT_BrrSBr_WRrrEO- writes a bitset file. 
15 CS_PRDCT_BITSET__CREATEO~ Creates a bitset in memory, 

** CS_PRDCT_BrrSET_DUMPO - Dumps the content of the file, 

** CS_PRDCr_BrrSET_GETHITS-- Returns the indexes for the requested 

** number of hits. 

♦* CS_PRDCT_BITSET_SBrBrrSO - Copies a raw bitset into ChemSpace 

20 ** compressed bitset. 

CS_PRDCT_BITSEr JNDEXES^TOJNDEX get bitset index 

Y_01.Y^02 etc. 

*♦ CSJPRDCT_BITSETjrp_RAWO - Copies a ChemSpace bitset back into 

** a raw bitset 

25 CS_PRDCT_BrrSET_CONCAT_RAW0 Copies a ChemSpace bitset 

♦^^ into part of a raw bitset 

♦* CS_^PRDCT_BrrSET_SELECTEDO - returns totalSelected 

CS_PRDCT_BrrSET_REVEALO - Returns dements of hidden bitset 
data structure to external program. 
30 ** CS_PRDCT_BITSET_SET_PRD_BITO - Sets a bit in ChemSpace bitset. 

** CS_PRDCT_BrrSEr_GET_STATSO - Returns totals for the bitset. 

** CS_PRDCT_BrrSET_CORE_^INFO0 - Gets core information from a bitset 
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file, 

CS_PRDCT_MSTR^COREJNFO0 - Gets core information from a master file, 
** CS_PRDGT_BITSET_GEnnTSO - Gets the indexes for the requested set of 

hits. 

5 ** CS^PRIXn' BITSEr_CTEATE_Bn'_STRINGO - Creates a compressed 

bitstring in memory 

** CS_PRDCT_BITSEr^DESTROY_BIT^STRING0-- Deletes a compressed bitset 

in memory. 

IQ ^**^******^^*m*mm*mmm0mmmmmmmmm0**mm*mmm*m***mmmmm^^ 
mm 

ii^lude <stdio.h> 

iKnclude <unistd.h> 
15 #include <string,h> 

#include <math,h> 

include •uO^str.h" 

iKnclude "utl_mem.h'' 

extern char *GetFuliPathNameO; 
20 ext^ char *basenameO; 

#define VERSION^NUMBER 1,001 

#drfine ALLOCATION_FACTOR 1.25 /*Extra room in each variation site for growth*/ 
fdefinc^dAX^VARIATION^SITES 255 

extern int setbits[8] ; 
25 extOT! int nbits[256] ; 

typedef struct MasterFileStruct 
{ 

char *masterFilePathName ; 
int masterRecNo ; 
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char *prefixForFiles; 

int numVariationSites ; 

int numberOfMissingBits; 

int Ibits; 
5 diar ^»IefilePathNalne; 

int startCoie; 

char ^'fingerFileName; 

int fingeiOfrset ; 

char **x_RlcName ; 
10 char **reagentInfo ; 

} MasterFileStnict ; 
typedef struct ProgramlnfoStruct 
{ 

char ^rogramName ; 
15 int buff^ize ; 

int *buffcr ; 
} ProgiamlnfoStruct ; 
typedef struct BitSetHleStruct 
{ 

20 struct MasterFileStnict masterFilelnfo ; 

struct ProgramlnfoStruct programlnfo ; 

int numVariationSites ; 

int *actuallSizes ; 

int "^allocSizes ; 
-25- int ^nuniFragsInEachSite ; 

int totalSdected ; 

int *bitset ; 

int firstHitAddress ; 
unsigned char **fragmaitBitset; 
30 ] BitSetFileStruct ; 
typedef struct 

{ 

struct BitSetFileStruct *bitset; 
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int num Variations; 
int *choiccs; 

void *call_udata; /"^ callback callback*s udata *l 

int (^^qnr)(void *udata, int num Variants, int ^choices ); 
S int totalExcluded; 

} BrrjTRACKING; 

static int RangeCallback ( void *udata, int staitRange, int endRange ); 

static int BitSetAddies^oIndexes(struct BitSetFileStnict *bitset,int address,int **indxs,int 
♦iiotUsedFlag) 
10 { 

int numVariations; 

int *allPtr; /* ailoced Sizes Ptr.*/ 

int *actPtn /* actual sizes Ptr */ 

int ^choices; 
15 int*chPtr; 
int X ; 
int i ; 

int skip = 1; 
int skipcnt — 0; 
20 numVariations = bitset- > numVariationSites; 

X - address ; 

/* 

If the caller did not give us space to put in the indexes, allocate our 
** own. 
25 */ 

if(! (*indxs)) 

(*indxs) = (int *) lJTL_MEM_CALL(X;(numVariaUons, sizeof(int) ); 
choices = *indxs ; 
if ( notUsedHag) 
30 *notUsedFlag = 0 ; 

for ( allPtr = bitset->allocSizes + (numVariations - 1), 

actPtr = bitset->actuallSizcs + (numVariations 
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1), 



chPtr = chcHces + (nuroVaiiations - 1 ), 
i = nuinVariations; 

»- ; 

allPtr-, actPtr-, chPlr- ) 



{ 



*diPtr = X % •idlPtr, 
if ( *chPtr > = •actPtr ) 
( 

10 if(!sldpcnt) 

{ 

skipcnt = 1; 

skip *= ( *allPtr - •chPtr ); 

} 

15 /• 

** If we are iqecting things out of bounds of the actual sizes then we 

** are out of hwe. 

•/ 

if(notUsedFlag) 
20 { 

*notUsedFlag = 1 ; 
break; 

} 

) 

25 X = X / *aUPtr: 

if ( ! skipcnt ) 

sldp ♦= •allPtr; 

} 

if ( Isldpcnt ) 
30 sldp = 0; 

return skip; 
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Static int TestDuinp(DurapCode, Parameter, Dumplnt) 
int DumpCode ; 
void *PaiametCT ; 
int Dumplnt ; 
5 { 

FILE *File = (FILE ♦)Panan^ ; 

fprintf(File, "%d\n*, Dumplnt); 

} 



static int CalculateAllocaledSizes(int nuinSites,int ^sizeSjint *a]locSizes) 
10 { 
int i ; 

for ( i = 0 ; i < numSites ; i4-+ ) 

allocSizespl = sizesW * ALLOC ATION_FACTOR ; 

} 

IS static int InitQ 
{ 

static int firstTime = 1 ; 
int i ; 

if ( firstTime ) 
20 { 

for (i=0;i<8;i++) sed>its[i] = ( 1 < < i) & 255; 
firstTime = 0 ; 

for 0=0;i<256;i++) nbitspj = (i&l) + (i&2)/2 + (i&4)/4 + (i&8)/8 + 

(i&16)/16 + (i&32)/32 

25 + (i&64)/64 + {i&128)/128 ; 
} 

} 

int setbits_nbits_InitO 
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InitO; 
return 1; 



5 


int *bitset ; 




int offset ; 




int 


numVariations 




int 


♦sizes ; 




int 


♦allocSizes ; 


10 


( 






void ^compressed ; 




int 


byte; 




int 


bit; 




int 


size ; 


15 


int 


total ; 




int 


i ; 




int 


index 1 ; 




int 


index2 ; 




int 


newPos ; ' 


20 


int 


rowLength ; 



unsigned char •bs = (unagned diar *)bitset ; 
InitO; 

#if 0 
/♦ 

25 ** Always calculate the allocated sizes, we are the only one who can set this. 
•/ 

if ( allocSizes[01 < = 0 ) * 

CalculateAllocatedSizes(numVariations,sizes,allocSizes); 

#endif 
30 /* 
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Make the allocated ^ the same as the real size. 

*/ 

for ( i = 0 ; i < numVariations ; i++ ) 

allocSizesO] = sizesp] ; 
S dze = allocSizes[0] ; 

for ( i « 1 ; i < numVariations ; i++ ) 

Mze *^ allocSizesfi] ; 
total — sizes[0] ; 

for ( i = 1 ; i < numVariations ; i+-f ) 
10 total *= sizes[t] ; 

if ( sizesIOl ! = allocSizestO] ) 
{ 

if ( !( compressed = (void '*')IHBDeciare( size ) ) ) 
goto AddTraceback ; . 

15 /* 

** Set the bitset if the caller supplied us with one. 
*/ 

if ( bitset ) 
{ 

20 rowLength = sizes[l] ; 

for ( i = 0 ; i < total ; i++ ) 
{ 

index! = i / rowLength ; 
index2 = i % rowLength ; 
25 byte==(i + offset)/8; 

bit = ( i + offset ) % 8 ; 
if ( bs[byte] & setbits[bit] ) 
{ 

newPos = ( index 1 * allocSizes[l] ) + index2 ; 
30 IHBSet(compressed, newPos ); 

} 

} 

IHBOptimize(compressed) ; 
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} 

} 

else 

{ 

5 if(bitset) 
{ 

if ( !( compressed = (void *)IHBDeclareWithInit(offset, size . bs ))) 
goto AddTraoeback ; 

} 

10 else 
{ 

if ( !( compressed = (void *)IHBDeclare( size ) ) ) 
goto AddTraceback ; 

} 

15 } 

return compressed ; 
AddTraceback : 

return (void *)NULL ; 

} 

20 /* 

This routine will create a compressed bitset that is bigger in every 
** dimension. 

void *<:reateCompressedBitSetExp(bitset,offset,num Variations, sizes^allocSizes) 
25 int *bitset ; 

tnt offset ; 

int num Variations ; 

int *si2es ; 

int *allocSizes ; 
30 { 

void ♦compressed ; 

int byte ; 
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int bit ; 
int aze ; 
int total ; 
int i ; 
5 int indexl ; 
int tndex2 ; 
int newPos ; 
int lowLength ; 

unsigned char *bs = (unsigned char *)bitset ; 
10 InitO; 
/* 

Always calculate the allocated sizes, we are the only one who can set this. 

*/ 

if (aUocSizeslOj < = 0) 
IS CalculateAllocatedSizes(numVariations,sizes,allocSizes); 

size = allocSizes[0] ; 

for ( i = 1 ; i < numVaiiati(ms ; i++ ) 
size *= allocStzes[i] ; 

total = sizes[0] ; 
20 for { i = 1 ; i < numVariations ; i++ ) 

total *= sizes[i] ; 

if ( sizes[0] ! = allocSizcs[0] ) 

{ 

if ( !( compressed = (void *)IHBI>eclare( size ) ) ) 
25 goto AddTracd>ack ; 

/* 

** Set the bitset if the caller supplied us with one, 
*/ 

if (bitset) 
30 { 

rowLength = sizes[l] ; 

for( i = 0 ; i < total ; i++ ) 

{ 
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indexl = i / lowLength ; 
index2 = i % rowLength ; 
byte = (i + offset) / 8 ; 
bit = (i + offset) % 8 ; 
if ( bs[byte] & s^its(bit] ) 
{ 

newPos = ( indexl * ailocSizes[l] ) + iiidex2 ; 
IHBSet(oonipressed, newPos ); 

} 



10 



} 

else 

15 if(bitset) 
{ 

if ( !( compiessed = (void *)IHBDecIaFeWidilnit(offset, size . bs ))) 
goto AddTraceback ; 

} 

20 else 
{ 

if ( !( compressed = (void *)IHBDeclare( size ) ) ) 
goto AddTraceback ; 

} 

25 } 

IHBOptiinize(compressed); 
return compressed ; 
AddTraceback : 

return (void *)NULL ; 

30 } 

CountBitSets(char *inputFileName) 
{ 

FILE *inputFile ; 
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int i ; 

char <!Une ; ^ 
long lengdi ; 
l(Mig next ; 
5 Icmg.here ; 

if ( ICinputFUe = fqwiCinputFilelfame/r"))) 

g(Mo.UnableToOpenFile ; 
i = 0; 
while ( 1 ) 
10 { 

here - fldlOnputFile); 

if ( -1 == UTL_SCAN_GErS( inputFUe. "W", &Une)) 
break; 

if ( im-_STR_NCMP_N<X:ASE0ine,"@BrFSET_LENGTH:",15) != 0 ) 
15 { 
/* 

Kludge for dbseaich old format. 

*/ 

if ( UTL_STR_NCKIP_NOCASE(line, "@B1TSET_START: " , 14) ! = 

20 0) 

break; 

} 

length = (long ) atoi(liiie+ 15) ; 
25 " next = here + length ; 

fsedcCinputFile,next,SEEK_SET); 

} 

return i ; 
UnableToOf^ehFile : ^ 
30 AdrfTracAack : 
return 0; 

} 
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/* 

••+E: 

5 ♦* Function Name : WriteOutCompressedBSFileO 

^ Purpose : Function vnll write out a compressed bitset to disk. 

mm 

*♦ Usage : To be used by dbsearch/de^gn programs to write out check 
10 *♦ points- 

Returns : 1 on success, 0 for failure. 

mm 

** Algorithms : None. 



15 



20 



Revision History : 
** Author Etete Description 



Fred Soltanshahi 07/25/96 Original version. 



*/ 

25 int 

WriteOutCompressedBSFile(outputFileName,masterFileName,masterRec,programName,^ 
mpressed,numVariations«sizes,aUocSizes,numSelected,numInSites,bufferSi2:e,buffer) 
char '"'outputFileName ; /* Name of ouq>ut file */ 
char *masterFiieName ; /* Name of master file */ 
30 int masterRec ; /* Which recond in the master file ♦/ 

char ^rogramName ; /* Name of program generating this. */ 

void ^compressed ; /* Compressed bitset ♦/ 

int num Variations ; /* Num XOl, X02 ... variation sites */ 
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bit *sizes ; Actuall sizes in each dimension *l 

int ^locSizes ; /* Allocated sizes(bitset size) in each diniension */ 

int numSdected ; Number of products selected 

int *numInSites ; Number of selecti<His in each Y_0? site *l 

S int bufferSize ; /* Program specific buffer size *i 

int buffer ; /* Afbitrary data written by the program 

{ 

FILE *ou^tFile ; 
float version = VERSION^NUMBER ; 
10 long bitsetSize ; 
time_t startTime ; 
int i ; 

int numBytes ; 
long bitsetStart = 0 ; 
15 long endOfFile =0; 
long length ; 

long beginingOfPile = 0 ; 
char *dir ; 
char *1>ase ; 
20 char *fullPathName ; 
/♦ 

** Calcualte the total size of the bitset. 
*/ 

bitsetSize « sizesiO] ; 
25 for ( i *= 1 ; i < numVariations ; i++ ) 

bitsetSize = bitsetSize * sizes[i] ; 
if ( UoutputFile = fopen{outputFileName,*w"))) 

goto UnableToOpenPile ; 

/* 

30 File format is : 
** Version Number 
** Date/Time Stamp 

"Location of Uie master file" "Record Number in the file". 
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A copy of master file iecoFds(one line at a time) 
** Number of Variatioii sites 

Actual(current) number of choices for eadi site. 

Allocated number of choices for each site. 
S Source - program that generated this fileJ.e. dbseaich, dbcsln_des,etc 

program command line parameters - ASCII rqnesentaUcm of parameters 
*♦ Program spedfic info - ASCII line of info-spedfic to program. 

*^ Number of products selected. 

** Number of selections in each dimension - num_Y_01_choices num_yj02jcbdices ... 
10 ** Bitset size — The compressed bitset size in bytes. 
** Bitset — Row major bitset of products selected. 
V 

/* ft»rintf(outputFile/@NEXT_SLOT:«0101d\n",endOfFile); */ 
beginingOfFile = ftell(outputFile); 
15 fiprintf(outputFile/@BrrSET_LENGTH:%0101d\n\bitsetStart)/ 
iprintf(outputFile/@VERSION:%fVn\vcrsion); 
timc( AstartTime ); 

4>rin(f(outputFile, " %s'' ,ctime(&startTime)); 
dir = GaEullPathName(masterFileName); 
20 base - basename(masterFileName,NULL); 

fuUPathName = UTL_HLE_ADD_DIR_TO_DIRSPEC(dir,base); 
fiprintf(ou4}UtFile/%s %d\n*',fullPathName,masterRec); 
UTL_MEM_FREE(fullPathName); 

if ( !DunipMasterFileInfo(ou^tFile,mast^FileName,niasterRec)) 
25 goto UnableToDumpMaster ; 

fprintf(outputFile, " %d\n " , num Variations) ; 
for ( i = 0 ; i < numVariations ; i++ ) 

fiprintf(outputFile/%d ",sizes[i]); 
fprintf(ou^utFile,"\n"); 
30 if (allocSizes[0] - 1 ) 

CalculateAllocatedStzes(numVariations,sizes,aliocSizes); 
for ( i = 0 ; i < numVariations ; i++ ) 

fprintf(outputFile," %d \allocSizes[i]); 
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fprintfCoutputFilc^Vn"); 
^rintf(outputFile, " %s\n" .programName); 
fprintf(<ni^tFile/ %d\n'',nuinSdected); 
for ( i « 0 ; i < numVariations ; i++ ) 
5 fprintf(outputHle/%d ".numlnSitesn]); 

fimntf(outputFile/\n"); 

/* 

Now the product bitset* 

*/ 

10 bitsetStait - ftell(ou^utFiIe); 

fprintf(outputFile/@PRODUCT_BITSETVn"); 

IHBDutnp(compiessed,TestDuinp,outputFile); 

fiprintf(outputFile, "©PROGRAM JNFO\n-); 

iiirintf(outputFile, " %d\n- ,bufferSize) ; 
15 if (buffer) 

fwrite(buffer,bufferSize, 1 ,outputFile); 

endOfFile = ftdl(ou^utFile); 

length = endOfFile - beginingOfHle ; 

fseek(ouqMitFile,beginingQfFile,SEEK_SET); 
20 fprintf(ou^utFile/@BITSET_LENGTH: %0101d\n", length); 

/* 

** Go back to the begining and write out the header info for the file 
*/ 

rewind(ou^tFile); 
25 /♦ fprintf(outputFile/@NEXT^SLOT:«0101dVn^endOm^^ 

fcIose(outputFile) ; 

return 1 ; 
UnableToDumpMaster : 

fiprintf(stderr/WriteOutCompressedBSFileO-Unable to dump master file 
30 % s\n * , masterPileName) ; 

goto AddTraceback ; 
UnableToOpenPile : 

fjprintf(stderr/WriteOutCompressedBSFileO"Unable to open 
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%s\n" ,outputFileName); 

goto AddTraceback ; 
UnableToWiiteToFile : 

fprimf(stderr/WriteOutCompress6dBSFile(^ to write to output fileVn"); 
S goto AddTraceback ; 

UnableToCieateBitSet : 

fprintf($tderr/WriteOutCompiessedBSHile()--Unable to a:eate compressed 
bitsetXn"); 

goto AddTraceback ; 
10 AddTraceback : 

return 0 ; 

} 

/* • 

15 

mm 

** Function Name : WriteOutCheckPointPileO 

Purpose : FuncticHi will write out a check point(bitset file to 
20 ** the given file. 

mm 

Usage : To be used by dbsearch/design programs to write out check 

** points; 
** 

25 Returns : 1 on success, 0 for failure. 
♦* 

** Algorithms : None. 

** Revision History : 
30 •* 

** Author Date £>escription 
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** Fred Soltanshahi 07/25/96 Original veraon. 

mm 

5 */ 
int 

WriteOutaieckPointFi]e(outputI^eName,masc^ 

offset,numVarialions,sizes,aUocSizes,numSele(ned^ 

char *outputHleName ; /* Name of output file *'/ 
10 char ^tnasterFUeName ; /* Name of master file 

int ma^Rec ; /* Which record in the master file ♦/ 

char *programName ; /* Name of program generating this. */ 

int *bitSet ; /* Actual bitset */ 

int offset ; /* Offset into the bitset */ 

15 int num Variations ; /* Num XOl, X02 ... variation sites */ 

int ^^zes ; /* Actuall sizes in each dimension */ 

int ^locSizes ; /^ Allocated sizes(bitset size) in each dimension 

int numSdected ; /* Number of products selected 

int ^umliiSites ; /* Number of selections in each Y_0? site */ 
20 int bufferSize ; /* Program specific buffer size */ 

int *buffer ; /* Arbitrary data written by the program */ 

void ^compressed ; 
/* 

25 Created a compressed bitset. 
*/ 

if ( !(compressed == CreateCompressedBitSet(bitSet, 

offset, 

30 numVariations, 

sizes, 



allocSizes) ) ) 
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goto UnableToCieateBitSet ; 
if ( !WriteChitConipressedBSFile(outputFileNaine, 

masterFileName, 
masterRec, 

S prpgramName, 

compressed, 
numVariadons, 
sizes, 
allocSizes, 

10 numSelected, 

numlnSites, 
bufferSize, 
buffer)) 

goto UnableToWrileToFile ; 
IS IHBDestroy(oompressed); 
return 1 ; 
UnableToWriteToFile : 

fjprintfCstderr/WriteOutCheckPointFileO-Unable to write to ouq)ut fileVn"); 
goto AddTraceback ; 
20 UnableToGreateBitSet : 

fprintf(stderT,"WriteOutCheckPointFileO"Unable to create compressed bitset\n"); 
goto AddTraceback ; / 
AddTracdiack : 
r^um 0 ; 

25 } 

static int TestRestore(DumpCode, Parameter, Dumpint) 
int DumpCode; 
void *Parameter; 
int*DurapInt; 
30 { 

FILE *fp = (FILE ♦)Parameter ; 
fscanf(^, "%d\n\ Dumpint); 
return 1 ; 
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} 

Static int GetReagentInfo(char •fileName.char **reagentInfo) 
{ 

FILE *fp', 
5 char buffei(4096] ; 
int i=0; 
char *qi ; 

if ( Hfp = fopen(fileNaine,''r")) ) 
return 0 ; 

10 while ( fgets(buffer.sizeof(buff(7y-l,fjp)) 

{ 

buffer(strien(bufrer)-l] = 0 ; 

if ( ( cp = strstr(buffer,"# USER_NAME=-)) ) 

■ { 

15 (*reagentinfo) = UTL_STR_SAVE(buffer+12); 

return 1; 

} 

/• ■ 

Only read the first 10 lines. 

20 */ 

if(i++ > 10) 
break; 

} 

/♦ 

25 ** We did not find the reagent info, lets save an empty string and 
** assume the file is old and does not contain the info. 
*/ 

reagenttnfo = UTL_STR_SAVE(""); 
return 1 ; 

30 } 
/* 



•♦+1: 
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•* Function Name : ReadCheckPointFileO 

5 Puipose : Function will write out a check point(bitset file to 
** the given file. 

Usage : To be used by dbseardi/design programs to write out check 
points. 

10 ** 

** Returns : 1 on success, 0 for failure. 

mm 

** Algorithms : None. 
. mm 

15 ** Reviaon History : 

** Author I>ate Description 

20 ** Fred Soltanshahi 07/25/96 Original veision. 

♦/ 

static int 

25 ReadCheckPointFile(inputFdeName4nputOffset,nmterFileName,mastefI^ 

bitSet,numVariations,$izes,allocSizes,numSdected,numInSites,masterInfo,buf^^ 
ff) 

char *inputFileName ; /* Name of input file */ 
int inputOffset ; /♦ Offset into the input file where the bitset starts*/ 
30 char **masterFileName ; /* Name of master file */ 

int *masterRec ; /* Which record in the master file */ 
char **programName ; /♦ Name of program generating this. ♦/ 
int **bitSet ; /♦ Actual bitset */ 
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int *nuin Variations ; 1*^ Num XOl, XQ2 ... variation sites *l 
int **sizes ; I* Actuall sizes in each dimension 

int *^ocSizes ; /* Allocated sizes(bitset size) in each dimension 
int ^umSdected ; Numb^ of products selected */ 
5 int •**numInSites ; /* Number of sdections in each Y_0? site ♦/ 
struct MasterFileStruct ^masterlnfo ; 
int *bufferSi2e ; /* program specific buffo size */ 
int **p'^gBuff ; /* Program specific buffer */ 

{ 

10 FILE *inputFile ; 

float version ; 

long bitsetSize ; 

timej startTime ; 

int i ; 
IS int numBytes ; 

char buffer[4096] ; 

char hold[81] ; 

char *cp ; 

if ( ICinputPile - fopenfmputFileName,"r"))) 
20 goto UnableToOpenFile ; 

/* 

** File format is : 

** @BrrSET_START: Where In the File The bitset starts 
** @VERSION:V^on NumberxunentVersion Number 
25 ** Date/Time Stamp 

** "Location of the master file" "Record Number in the file*. 

©MASTER NumberOfLines : Number of lines used for the master file. 
A copy of master file records(one line at a time.currenlly 1 1 lines ) 
** Number of Variation sites 
30 ** Actual(current) number of choices for eaK:h site. 
** Allocated number of choices for each site, 

** Source - program that generated this file.i.e. dbsearch, dbcsln_des,etc 
** Number of products selected. 
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Number of sdecdons in each dimension - num_Y_01_choices num_y_02_choices 
** ©PRODUCT^BITSET 

Bitset - Row major bitset of products selected. 
©PROGRAM^INFO 
5 size of piogram specific buffer 
program qiedfic buffer 

♦/ 
/* 

** BrrSET_START: 
10 */ 

if ( !fgets(buffer,sizeof(buffer)-I,inputFile)) 
goto UnableToReadFile ; 

/* 

** VERSION, 
15 */ 

if ( !fgets(buffi^,sizeof(bufkr)- 1 ,inputFile)) 

goto UnableToReadFile ; 
cp = strstrCbuffer/SVERSION:"); 
if(!cp) 

20 goto Vmi<mMissing ; 

version = atof(buffer); 

if ( !fgets(bufri^,sizeof(buffer)-l,inputFile)) 
goto UnableToReadFile ; 

if (~!fgets(buffer,sizeof(buffiM')- 1 ,inputFile)) 
25 goto UnableToReadFile ; 

sscanf(buffer/%s %d',hold,masterRec); 

(♦masterFileNamc) = UTL__STR_SAVE(hold); 

if ( masterlnfo ) 

{ 

30 masterlnfo- > masterFilePathName = UTL_STR_SA VE(hold); 

masterlnfo- >masterRecNo = *masterRec ; 

/* 

©MASTER 11 thing. 
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•/ 

if ( !fgets(buffer,azeof(buffer>-l,inputFile)) 
goto UnableToReadFUe ; 

/♦ 

S ReacticMi class 
♦/ 

if ( !fgets(buffer.si2eof(bu£fer)-l,inputFUe)) 
goto UnableToReadRle ; 

10 ** Prefix 
*/ 

if { !fgets(buffer,sizeof(buffer)-l,inputFile)) 

goto UnabteToReadFile ; 
buffalstrien(buffer)-l] = 0 ; 
15 inasterInfo->prefixForFiics = 

UTL_STR_SAVE(UTL_FILE_PARSE(buffer,4)); 
/♦ 

** number of sites 
*/ 

20 if ( !fgets(buffer,sizeof(buffer)- 1 .inputFile)) 

goto UnablcToReadFile ; 
inasterInfo->numVariationSites = atoi(buffer); 

/* 

** mis^ng bits 
25 */ 

if ( !fgets(buffer,sizeof(buffer)-l,inputHle)) 

goto UnableToReadFile ; 
masterInfo->numberOfMissingBits = atoi(buffer); 

/* 

30 ** Lbits 
•/ 

if ( !fgets(buffer,si2eof(buffer)-l,inputFile)) 
goto UnableToReadFile ; 
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masterInfo->Ibits = atoi(buffer); 

/* 

** core file 
*/ 

5 if ( !fgets(bufrer,sizeof(bufFer)-l,inimtFn^ 

goto UnableToReadFUe ; 
buffer[strlai(buffer)-l] = 0 ; 

masterinfo->corefiIePathNanie = UTL_STR_SAVE(buffo); 

/♦ 

10 ** Start core 
♦/ 

if ( !fgets(buffer,sizeof(buffer)-l,inputFile)) 

goto UnableToReadFile ; 
masterInfo->startCore = atoi(buffCT); 

15 /* 

** fingure print file name 
*/ 

if ( !fg;ets(buffe,si2eof(buffer)-l,inputFile)) 
goto UnableToReadFile ; 
20 buffer[strlen(buffer)-l] = 0 ; 

mastCTlnfo->fingerFileName = ini-_STR_SAVE(buffer); 

/* 

** start fingure print 
*l 

25 if ( !fgets(buffer,sizeof(buffer)-l,inputFile)) 

goto UnableToReadFile ; 
masterInfo->fingerOffset = atoi(buffer); 
if (!(masterInfo- > x_FileName = (char **)UTL_MEM^CALLOC( 

masterlnfo- > numVariationSites, 
30 sizepf(char *))) ) 

goto AddTraceback ; 
if (!(masterInfo->rcagenanfo = (char ♦♦)UTL_MEM_CALLOC( 

masterlnfo- > numVariationSites, 
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sizeof(char *))) ) 

goto AddTracdjack ; 

/* 

** X1 file pathname 
5 */ 

if ( !fgets(buffer,sizeof(buffcr)-l,inputFUe)) 

goto UnableToReadFile ; 
buffeiCstilcn(buffCT)-l] = 0 ; 

masterInfo->x_FUeName(0] = UTL_STR_SAVE(buffer); 
10 if { !GetReagcntInfo(buffcr,&masterInfo- > ieagentlnfo[0])) 

goto UnableToReadReagentfnfoO ; 

X2 file pathname 

*/ 

15 if ( !fgets(buffer,sizeof(buffer)-l,inputFile)) 

goto UnableToReadFile ; 
buffcr(strien(buffer)-l] = 0 ; 

masterInfo->x_FUeName[l] = UTL_STR^SAVE(buffer); 
if ( IGetReagentInfo(buffer^&raasterInfo- > reage»itInfo[l])) 
20 goto UnableToResulReagentlnfol ; 

} 

else 
{ 

25 ** just skip the 12 lines. 
*/ 

for( i = 0 ; i < 12 ; i++ ) 
{ 

if ( !fgets(buffer,sizeof(buffer)-l,inputFile)) 
30 goto UnableToReadFile ; 

} 

} 

if ( !fgets(buffer,si2eof(buffer)-l,inputFile)) 
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goto UnableToReadFile ; 
*num Variations » atoi(buffer); 
if ( !fgets(bufrer,sizeof(bufrer)-l,inputFile)) 

goto UnableToReadFile ; 

5 /* 

We have to allocate the arrays for sizes, allocSizes and fiagSizes . 

*/ 

if ( !( (•sizes) = fmt •)UTL_MEM_CALLOC(*numVariations, 

sizeof(int)) 

10 )) 

goto AddTraceback ; 
if ( !( (»aUocSizes) = (inl *)UTL_MEM_CALLOC(*num Variations, 

sizeof(int)) 

)) _ 
15 goto AddTraceback ; 

if ( !( (*nuniInSites) = (int *)UTL_MEM_CALLOC(*numVariations, 

sizeof(int)) 

)) 

goto AddTraceback ; 
20 q) = buffo- ; 

for ( i = 0 ; i < *num Variations ; i++ ) 

{ 

sscanf(cp," %d",&(Csizes)(i])); 
q> = strstr(cp," "); 

25 } 

if ( !fgets(buffer,azeof(buffer)-i,inputFiIe)) 
goto UnableToReadFile ; 
cp = buffer ; 

for ( i = 0 ; i < *num Variations ; i++ ) 
30 { 

sscanf{cp. " % d" ,&((*allocSizes)[il)); 
cp = strstr(cp," "); 

} 
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if ( ! fgets(buffer,sizeof{buffer)-l ,inputFile)) 

goto UnableToReadFile ; 

buffi^strlen(bufrer)-l] = 0 ; 
(♦prognunName) = UTL_STR.SAVE(buffer); 

5 if ( !fgels(buffer,si2eof(buffer)-l,iniwtFUe)) 

goto UnableToReadFile ; 
^umSelected = atoi(buffer); 
if ( !fg^5(buffer,sizeof(bufrer)-l,inimiFile)) 
g<^ UnableToReadFile ; 
10 q> = buffer ; 

for ( i = 0 ; i < *nuniVariations ; i++ ) 

{ 

sscanf(cp/%d-,&((*numInSites)[i])); 
cp — strstr(q)/ 

15 } 
/* 

©PRODUCT^BITSET 

*/ 

if ( IfgctsCbuffM-jSizeofCbuffeO-lJnputFile)) 
20 goto UnableToReadFile ; 

IHBRcstore(bitSet,TestRestoreJnputFile); 

/* 

♦* ©PROGRAM^INFO 
*/ 

25 if ( !fgets(buffer,si2eof(buffer)-l ,inputFile)) 

goto UnableToReadFile ; 
fclose(inputFile); 
return 1 ; 
Vo^ionMissing : 

30 fiprintf(stderr,"ReadCheckPointFileO-File %s is not a valid ChemSpace 

file\n" ,inputFi1eName); 

goto AddTraceback ; 
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UnableToReadFile : 

fiprintf(stdcaT,"ReadCheckPointFileO"Un to read from fileVn"); 
goto AddTraceback ; 
UnableToDumpMaster : 
5 fiHintf(stderr/ReadCheckPotntHleO-*UnabIe to dump master file 

%s\n*,masterlnldlame); 

goto AddTracd»ck ; 
UnableToReadReagendnfoO : 

fprintf($tderr/ReadCheckPdntFileO--Unable to read leagoit info in XsXn", 
10 masterlnfo- > x_FileName[Oj); 
UnableToReadReagentlnfol : 

^rintf(stderr,"ReadCheckPointFileO-Unablc to read reagent info in %s\n", 
masterlnfo- > x_FileName[ll); 
goto AddTraceback ; 
15 UnableToOpoiFile : 

fjprintf(stderr,''ReadCheckPointFileO-Unable to open %s\n*,inputFileName); 
goto AddTraceback ; 
UnableToWriteToFile : 

fprintf(stderr/ReadCheckPdinarileO-Unable to write to output file\n*); 
20 goto AddTraceback ; 

AddTraceback : 
return 0 ; 

} 

static struct BitSetFileStruct *ReadAndAUocateMaster(char *masterFileName,int 
25 masterRecNumber.int *initBitset) 
{ 

stnict BitSetFileStruct ♦bits^ ; 

FILE ♦mfFP ; 

/* 

30 ** First allocate everything we need. 
*/ 

if ( !(bitset = (struct BitSetFileStruct *)UTL_MEM_CALLOC(l, 
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sizeof(struct KtSetFileStnict )))) 



goto AddTraceback ; 

bitset->masterFileInfo.inastcrFilePathName = UTL_STR_SAVE(inasteFaeName); 
bitset->inasterFaeInfo.masterRecNo mast^RecNumber; 

5 /* 

** Some if this info is fixed for now. 
*/ 

bitset->piograndnfo.programName = (char *)NULL ; 
bitset->nAastaFiIdnfo.numVariationSites - bitset->numVariationSites = 2 ; 
10 bitset->totalSelected =0; 

if (!(bitset- > masterFilelnfo.x^FileName = (char **)UTL_MEM_CALLOC( 



bitset- > numVariationSites, 
sizeof(char *))) ) 



15 



goto AddTraceback ; 
if (!(bitset->actuaUSizes = (int *)UTL_MEM_CALLOC( 



bitset- > numVariationSites, 
sizeoffrnt ))) ) 



goto AddTraceback ; 
if (!(bitset->allocSizes = Ont *)UTL_MEM^GAL1X)C( 



20 



bitset- > numVariationSites, 
sizeof(int ))) ) 



goto AddTracdiack ; 
if (! (bitset- >numFragsInEachSite = (int •)lITL_MEM_CALLOC( 



25 



bitset- > numVariationSites, 
sizeofCmt ))) ) 



goto AddTraceback ; 
if ( !RetrieveMasterFile(masterFileName, 

mfFP. 



mast^RecNumber, 



30 &(bttset* > masterFilelnfo.numberOfMissingBits), 

&(bitset- > masterPilelnfo.lbits), 
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&(bUset- > masterFilelnfo.corefilePathName), 



&(bitset- > masterFilelnfo.fingerHieName), 



&(bitset- > masterFileInfb.x_FileName[0]), 



PCT/US97/01491 



&(bitset- > ihasterFilelnfo.startCoie), 



5 &(bitset- > masterFiMnfo.x_FileNaine[lD, 



10 



15 



&(bitset- > actiiallSizes[0]), 
&(bitset- > actuaUSizes[l]), 
NULL. 

&(bitset- > masterFilelnfo.fingeiOffset), 

NULL, 

NULL, 

NULL, 

NULL. 

NULL. 

NULL)) 



goto AddTraceback ; 
if ( !( bitset->bitset = GreateCompK^sedBitSeKinitBitset, 

0, 

bitset- > numVariationSites, 
20 . bitset- >actuallSizes, 

bitset- > allocSizes)) ) 

goto AddTnicd>ack ; 
return bitset ; 
AddTraceback : 

25 fprintf(stdeiT/ReadAndAUocateMasterO-Unable to read master fileXn"); 

return (struct BitSetFileStruct ♦) NULL ; 

} 



static int CalcuIatcFragsInSties( struct BitSetFileStruct *bitset ) 
{ 
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undgned char **bitAnay ; 
int t ; 
intj; 
int aze ; 
S int address ; 
int *ihdxs = 0 ; 
int what ; 
int this ; 

InitO; 

10 if ( !( bitAnay = (unsigned char 

**)UTL_MEM_CALLOC(bitset- > num VariationSites, 

sizeof(unsigned char *))) ) 

goto AddTraceback ; 
15 for ( i = 0 ; i < bitset-> num VariationSites ; i++ ) 

{ 

size = ( bitset->actuaUSizesril + 7 ) / 8 ; 
if ( ICbitAnayp] = (unsigned char 
*)UTL_MEM_eALLOe(si2e,Meof(unsigned char)))) 
20 goto AddTraceback ; 

} 

for ( address = -1 , i = 0 ; i < bitset->totalSelected ; i++ ) 
{ 

address = IHBFindNcxtOne(bitset->bitset,address+l); 
25 BitSetAddi:essToIndexes(bitset,addFess,&indxs,0); 
/* 

** For every hit and every variation site address set the bit to 1. 
*/ 

for ( j = 0 ; j < bitset-> num VariationSites ; j++ ) 
30 { 

what = indxslj] % 8; 
this = indxs[j] / 8; 
bitArrayO][this] | = setbits[what]; 
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} 

} 

for ( i = 0 ; i < bitset->numVariationSites ; i+-f ) 

{ 

aze - ( bitset->actuallSizes[i] + 7 ) / 8 ; 
bitstt->nuinFragsInEachSiteOi = 0 ; 
for(j = 0;j < size;j++) 

bitset->numFiagsInEachSite[i] nbits[bitArray[i]ij] & 255]; 

} 

/* 

Get read of memory we allocated for calculating the hits the last time. 

•/ 

if ( bitset->fragmentBitset ) 

{ 

for ( i = 0 ; i < bitset->numVariationSites ; ) 

UTL_MEM_FREE(bitset- > fragmentBitset[i]); 
UTL^MEM_FREE(bilset- > fragmentBitset ); 

) 

b]tset-> fragmentBitset = bitArray ; 
UTL_MEM_FREE(indxs); 
return 1 ; 
AddTraceback : 
return 0 ; 

} 

static int GetPartialProductsAddresses(struct BitSetFileStruct *bitset,int numFixedSites, int 
*fixedSitesIndexes, int site, int **hitlndexcs) 

( 

int i ; 
intj ; 
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int k ; 

int address ; 

int *iiidxs = 0 ; 

int skipit ; 
5 int numHits = 0 ; 

unsigned char **bitArray ; 

int size ; 

int what ; 

int this ; 
10 InitO; 

if ( !( bitArray = (unsigned char 

**)UTL_Mai_CALLOC(bitset- > numVariationSites. 

sizieof(un^gned char ^)) ) 
IS goto AddTraceback ; 

for ( i = 0 ; i < bitset- >numVariationSites ; i++ ) 
{ 

** We only want to count the fragm^ts for the site that is being exploded. 
20 */ 

if ((site != -1 )&&(i!- site) ) 
continue ; 

size = ( bitsct->actuallSizes[il + 7 ) / 8 ; 
if ( !(bitArFay[i] = (unsigned char 
25 *)UTL_MEM_CALLOC(size,sizeof(unsigned char)))) 

goto AddTraceback ; 

} 

for ( address = -i , i = 0 ; i < bitset->totalSelected ; i++ ) 
{ 

30 address = raBFindNextOne(bitset->bitset,address+l); 

BitSetAddressToIndexes(bitset,address,&indx$,0); 

/* 

** The sites that have alr^dy been expanded will constraint what hits 
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** we find. 
*/ 

if ( numFixedSites ) 

{ 

5 skipit = 0 ; 

for ( k =» 0 ; k < lritset->numVariadonSites ; k++ ) 

{ 

if ( fixedSitesIndexes[k] = = -I ) - 
continue ; 

10 /* 

our iiit index matches our constraint index. 

*/ 

if ( fixedSitesIndexes[k] ! = indxs[k] ) 
{ 

15 skipit = I ; 

break; 

} 

} 

if (skipit) 
20 continue ; 

} 

numHits++ ; 

for ( j = 0 ; j < bitsct->numVariationSites ; j++ ) 
{ 

25 if((site!= -I )&&(j != site) ) 

continue ; 
what = indxsQ] % 8; 
this = indxsQ] / 8; 
bitArray(j][thisl ! = selbits[whatl; 

30 } 
} 

jKfO 
/• 
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Hgure out how many hits thwe are for this site. 

*/ 

for ( k = 0 ; k < bitset->nuinVariationSites ; k++ ) 
{ 

5 if( site!=k)r 

continue ; 

sizc^ ( bitset->actuallSizes[k] + 7 ) / 8 ; 
numFraginentsPerSite[k] - 0 ; 
for(j = 0 ; j < si2e;j++ ) 
10 numFragmentsPcrSiteM += nbits[bitArTay[k][j] & 255]; 

} 

#endif 
/* 

Now get the indexes for all the hits.- 

15 ♦/ 

if ( !( (*hitlndexes) = Ont ♦)UTL_MEM_CALLOC(numHits, 



^zeof(int))) 



) 

goto AddTraceback ; 
20 numHits = 0 ; 

for ( k = 0 ; k < bitset->nuinVariationSites ; k++ ) 
{ 

if(ate==-l) 
{ 

25 if ( fixedSitesIndexesM != -1 ) 

continue ; 

} 

else 
{ . 

30 if(site!=k) 

continue ; 

} 

size = ( bitset->actuallSiz^[k] + 7 ) / 8 ; 
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for ( i = 0 ; i < sire : i++ ) 
{ 

if ( bitAnayMp] ) 
{ 

5 /* 

** If any bit is set in the byte then we iwed to figure out what the hits aie. 
*/- 

for(j =0;j < 8 ;j++) 

10 if ( tHtAnay[k](i] & setbitsOl ) 

( 

(*hitIndexes)[numHits4-+] = i * 8 + j ; 

} 

' ) 

15 } 

} - 

} 

for ( i = 0 ; i < bitset->numVariationSites ; i++ ) 
if ( bitArray[i] ) 
20 UTL_MEM_FRffi(bitAnay(il); 
. UTL_MEM_FREE(bitAnay ); 
letuni numHits ; 

AddTiacdMkck.: 
return 0 ; 

25 } 



static int Gea>ardalProductsStats( struct BitSetFileStnict *bitset , ini numFixedSites, ini 
"^fixedSitesIndexes, int *numProducts, int *numFragmentsPerSite) 

{ 

int i ; 
30 intj ; 
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int k ; 
int address ; 
lilt *indxs ~ 0 ; 
int skipit ; 
5 int numHits = 0 ; 

unsigned diar '^^bitAnay ; 
int ^ze ; 
int \rfiat ; 
int this ; 
10 InitO: 

if ( !( bitAnay = (unsigned char 
♦*)UTL_MEM_CALLOC(bitset- > numVariationSites, 

sizeof(uns]gned char *))) ) 
15 goto AddTraceback ; 

for ( i = 0 ; i < bits^->numVaiiati(mSites ; i++ ) 
{ 

/* 

** We only want to count the fragments for the sites that are not being 
20 exploded. 
*/ 

if { fixedSitesIndexesp] ! = -1 ) 
continue ; 

size = ( bitset->actuallSi2es[i] + 7 ) / 8 ; 
25 if ( !(bitArFay[i] = (unsigned char 

*)UTL_MEM_CALLOC(M2e,si2eof(unsigned char)))) 
goto AddTraceback ; 

} 

for ( address = -1 , i = 0 ; i < bitset*>totalSelected ; i++ ) 
30 { 

address = IHBFindNextOne(bitset->bitset,address+l); 
BitSetAddressToIndexes(bitset,address,&indxs,0); 
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The sites that have already been expanded will constraint what hits 
** we find, 
♦/ 

if ( numFixedSites ) 
5 { 

skipit = 6 : 

fotX-k = 0 ; k < bitset->nuniVaiiatiQnSites ; k++^) 
{ 

if ( fixedSitesIndexesM == -1 ) 
10 continue ; 

/* 

our hit index matches our constraint index. 

*/ 

if ( fixedSitesIndexes[k] != indxs[k] ) 
15 { 

skipit = 1 ; 
break; 

} 

} 

20 if (skipit) 

continue ; 

} 

nuniHits++ ; 

for ( j = 0 ; j < bitset->numVariationSites ; j++ ) 
25 { 

if ( fixedSitesIndexesDl != -1 ) 

continue ; 
what = indxslj] % 8; 
this = indxsQ] / 8; 
30 bitArray01[this] | = setbits[what]; 

} 

} 

for ( k = 0 ; k < bitset- > numVariationSites ; k+ + ) 
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{ 

if ( fixedSitesIndexes[k] != -1 ) 
continue ; 

size = ( bitset->actuallSizes[k] + 7 ) / 8 ; 
i nuroFragmentsPeiSitBpc] = 0 ; 

for(j = 0;j < a2e;j++) 

numFiagmentsPerSite(k] += nbits[bitAmy[k][]] & 255]; 



} 

*numProducts = numHits ; 
10 for ( i = 0 ; i < bitset->numVariationSites ; i++ ) 

if ( bitArrayfi] ) 

UTL_MEM^FREE(bitArrayCi]); 
UTL^MEM^FREE(bitAiTay ); 
return numHits ; 

IS AddTraceback : 
leturn 0 ; 

} 

static int GetPartialProductsC struct BitSetFileStruct *bitset , int numFixedSitcs, int 

"^fixedSitesIiidoces, int whichSite, int **sitelndexes) 
20 { 

int i ; 

int j ; 

int k ; 

int address ; 
25 int *indxs = 0 ; 

int skipit ; 

int numHits = 0 ; 
InitO; 

for ( address = -1 , i = 0 ; i < bitset- > totalSelected ; i-h+ ) 
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{ 

address » IHBFindNextOne(bitset->bitset,address+l); 
BitSetAddres^oIndexes(bitset,addiess,&indxs,0); 

/* 

5 The dies diat have already been expanded wiU constraint what hits 
** we find. 

*/ 

if ( numFixedSites ) 
{ 

10 sldplt = 0 ; 

for ( k = 0 ; k < bitset->numVariationSites ; k++ ) 
{ 

if ( fixedSitesIndexes[k] == -1 ) 
continue ; 

15 /* 

** our hit index matches our constraint index. 
*/ 

if ( fixedSitesIndexesPc] != indxs[k] ) 
{ 

20 skipit = 1 ; 

break; 

} 

} 

if ( skipit ) 
25 continue ; 

) 

numHits++ ; 

flprintf(stderr,*Got a hit on %d %d %d\n",address,indxsIO],indxs[l]); 
} 

30 return numHits ; 

AddTiaceback : 



return 0 ; 



wo 97/27559 



PCT/US97/D1491 



557 

} 

Static GeiFtagmentsUsedInASite( struct BitSetFileStruct *bitsel , int whichSite , int 
indxs) 

{ 

5 unsigiied diar *1>itAnay ; 

int i ; 

intj; 

int dze ; 

int ^address ; 
10 int numHits = 0 ; 

int bit ; 

if ('(address = (int 

*)UTL_MEM_CAllXX:(bilset- > numFragsInEachSite[whichSitc], 

sizeofOnt))) ) 

15 goto AddTraceback ; 

/* 

*♦ Figure out how many ints there are in this bitset. 
*/ 

aze = ( bitset->actua]lSizes[whichSite] + 7 ) / 8 ; 
20 for ( bitArray = bitset->fragmentBitset[whichSite], i = 0 ; i < size ; i++ ) 

{ 

if (bitArray[i] ) 
{ 

/* 

25 If any bit is set in the byte then we need to figure out what the hits are. 
*/ 

for(j =0;j < 8 ;j + +) 
{ 

if ( bitArray[il & setbitsOJ ) 
30 { 

address[numHits+ +1 = i * 8 + j ; 

) 
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} 



5 



} 

(^Indxs) = address ; 
return numHits ; 



AddTraceback : 



return 0 ; 



static struct BitSetFileStruct *ReadAndAllocate(char *fileName ,int offset ) 
10 { 

struct BitSetFileStruct *bitset ; 

if ( !(bitset = (struct BitSetFUeStruct *)UTL_MEM_CALLOC(l, 
sizeof(stnict BitSetFileStruct )))) 
goto AddTraceback ; 
15 if ( !ReadCheckPointFile(fiIeNaine, 



offset. 



&(bitset- > masterFilelnfo.mastcrFilePathName), 



&(bitset- > masterFilelnfo.masterRecNo), 



20 



25 



&(bitset- > programlnfo^piogramName), 

&(bitset->bitsrt), 

&(bitset- > numVariationSites), 

&(bitset- > actuallSizes), 

&(bitset- > allocSizes), 

&(bitset- > totalSelected). 

&(bitset- > numFragslnEadiSite), 

&(bitset- > masterFilelnfo), 

&(bitset- > programlnfo.bufferSize), 

NULL)) 



goto AddTraceback ; 



wo FCTAJS97/01491 

559 

return bitset ; 
AddTraceback : 

return ( struct BitSetFileStruci *)NULL ; 

} 

5 static int ReadBits^CoreInfo(void *bs^ cbar **masterFile(fame, int ^masterRecno, char 
**core, char **xrString, int *numSites, char ***xFileNanies ) 
{ 

struct BitSctf'ileStruct bitset = (struct BitSetFileStnict bs; 
int recNo ; 
10 FILE 
int i ; 

int found == 0 ; 
char *line ; 
char ♦cp ; 
15 char *cpl ; 

'*'numSites = bitS(^->numVariationSites ; 

♦masterRlcName = bitset- >masterFileInfo.masterFiiePathName; 

*masterRecno = bitset- > masterFilelnfo.masterRecNo; 

if ( !((*xFaeNames) = (char **)UTL^MEM_CALLOC(*numSites, 

20 azeof(char *)) )) 

goto AddTraceback ; 
for ( i = 0 ; i < *numSites ; i++ ) 
(♦xFileNames)(il = 
UTL_STR_SAVE(bitset- > masterFileInfo.x_FileName[i]); 

25 

** Open the core file and read in the core and parse out the XRstring. 
*/ 

if ( !(ip = fopen(bitset->masterFileInfo.corefiIePathName,"r")) ) 
goto UnableToReadCore ; 
30 recNo = 0 ; 

found = 0 ; 
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whUe ( -1 ! = UTLSCAN_qErS( fp. '\V. Aline)) 
{ 

recNo++ ; 

if ( recNo = = bits«- > masterFilelnfo.staitCore ) 
5 { 

found = i ; 
break; 

> 

} 

10 if(!fbund) 

goto UnableToReadCore ; 

/• 

Replace all occurances of Y_Ox with Xx. 

*/ 

15 (*core ) = UTL_STR_SAVE(Iine); 

q> = strstrOtne,"XRLIST="); 
if(!cp) 

(•xrString) = UTL_STR_SAVEC"); 

20 { 
/* 

*'* Skip the first double quote. 
cp+=8; 

25 /* 

** Go find the end of double quotes. 
*/ 

q)l = q) ; 
while ( (*cp) !=•"•) 
30 cp++ ; 

•cp = 0; 

(♦xrString) = UTL_STR_SAVE(cpl); 

} 
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fcIose(fjp); 
return 1 ; 
UnableToReadCore : 

fprintfCstderr^'ReadBitsetCorelnfoO - Unable to read core %s %d\n", 
5 bitset->mast^FileInfQ.corefilePathNaine, 
bitset- > masterFilelnfo.startCme); 
AddTraceback : 

fprintf(stderr/ReadBitsetCoreInfo^ - Unable to read core infoXn"); 
return 0 ; 

10 } 

static int ReadMasterCoreInfo(char *masterFile, int index, char **core. char ♦♦xrString, 

int '*'numSites, char ***xFileNanies ) 

{ 

int recNo ; 
15 FILE *fp; 
int i ; 

int found = 0 ; 

char *line ; 

char *cp ; 
20 char *cpl ; 

char *iHefix = (char *)NULL ; 

char *coreFile = (char *)NULL ; 

char *fpFileName = (char *)NULL ; 

int fpOffset ; 
25 int mBits ; 

int IBits ; 

int startCore ; 

♦numSites = 2 ; /* fixed for now */ 

if ( !((*xFileNames) = (char **)UTL_MEM_CALLOC(*numSites, 



30 sizeof(char *)) )) 
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goto AddTraceback ; 



** Get the master file info. 
*/ 

S if ( !GetMast^RecoidHeader(mast^File, 



10 



15 



index, 

prefix, 

&mBits, 

&IBits, 

&coreFile, 

&staTtCoTe, 

&(*xFileNames)t01. 

&(*xFileNames)[i]. 

numSites, 

&fpFileName, 

&fpOffsel)) 



goto AddTraceback ; 



Open the core file and read in the core and parse out the XRstring. 

20 */ 

if ( Ufjp = fopenCcorcFile/r")) ) 
goto UnableToReadCore ; 
recNo = 0 ; 
found = 0 ; 

25 while ( -1 != UTL_SCAN^GETS( fp, "W", &line)) 

{ 

recNo++ ; 

if ( recNo == startCore ) 
{ 

30 found = 1 ; 

break; 

} 

} 
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if (!found ) 

goto UnableToReadCore ; 
(*core ) = l)TL_STR_SAVE(line); 
cp = strstr(line,"XRLIST="); 
5 if(!cp) 

(*xiString) = UTL_STR_SAVE(""); 

dse 
{ 

f* 

10 ** Skip the first double quote. 
*/ 

cp += 8 ; 

/* 

** Go find the end of double quotes. 
15 */ 

cpl = q) ; 

while ((*q>) != '"') 

cp++ ; 
*cp = 0; 

20 (•xiString) = XrrL_STO_SAVE(q)l); 

} 

fdose(fjp); 
if ( coreFile ) 

UTL_M^_FRffi(coreFile); 
25 if(fpFUeName) 

UTL^MEM^FREE(fpFileNaine); 
if ( prefix ) 

UTL_MEM_FRra(prefix); 
return 1 ; 
30 UnableToReadCore : 

fprintf(stderr,"ReadMastersetCoreInfoO - Unable to read core %s 9&d\n", 
corePile.startCore); 
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AddTiaceback : 

iprintf(stdaT,"ReadMastKsetCoreInfoO — Unable to read core info\n"); 
if(corcFile) 

UTL_MEM_FREE(coreFilc); 
S if(^FileNaine) 

UTL_MEM_FREE(fpFileNaine); 
if ( prefix ) 

UTL_MEM_FREE(prefix); 
return 0 ; 

10 } 

static void DeallocaleBitset( struct BitSetFileStnict *bitset ) 
{ 

int i ; 

if ( bitset- > masterFilelnfo.masterFilePathName ) 
15 inT._^MEM_FREE(bitset->nustcrFfleInfo.masterFilePa^ 
if ( bits^- > masterFildnfo.corefilePathNamc ) 

UTL_MEM_FREE(bitset" > masterFilelnfo.corefilePathName); 
if ( bitset- > masterFildnfofing^FileName ) 

UTL_MEM_FREE(bitset- > masterFfldnfo.fingerFileName); 
20 if ( bitset- > masterFilelnfo.piefixForFiles ) v 

UTL_MEM_FiaEE(bitset- > masterFilelnfo.prefixForFiles); 
for ( i = 0 ; i < bitset- >inasterFiieInfo.numVariationSites ; i++ ) 

UTL_MEM_FREE(bitset- > ma$terFiIeInfo,x_FileName[i]); 
if (bitset- >inasterFileInfo,x_FileName) 
25 UTL_^MEM_FREE(bitset- > masterFilelnfo.x^FileName); 

if ( bitset- > prognunlnfo.programName ) 

UTL_MEM_FREE(bitset- > prograinlnfo.prograinName); 
if ( bitset- > prpgnunlnfo.buffer ) 

UTL MEM FREE(bitset- > programlnfo.buffer); 
30 IHBDcstioy(bitset- > bitset); 

if ( bitset- > actuallSizes ) 

UTL_MEM_FREE(bitset- > actuallSizes); 
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if ( bitset*>allocSizes ) 

UTL.MEM.FREE(bitset->aIlocSizes); 
if ( bitset- > numFragsInEadiSite ) 

UTL_MEM_FREE(bitset- > numFragsInEachSite); 
5 UTL_MEMJFREE(bitset); 

bits^ = (struct BitSetFileStruct *) NULL ; 

} 

void CS^PRDCT_BrrSET^DUMP( struct BitSetFileStruct *bitset ) 
{ 

10 int i ; 
int indx ; 
int indxl ; 
int indx2 ; 

fprintf(stderr /Master file name : 
15 %s\n",bitset->masterFileInfo.masterFilePathName); 

fjprintfCstderr/MastCT file rec : %d\n", bitset- >mastarFileInfo,masterRecNo); 
fprintf(stden/PFOgram Name : %s\n*,bitset->pn)grananfo.ptogramName); 
fiprintf(stderr, "Number of Sites : %d\n", bitset- >numVariationSites); 
fiprintf(stderr, "Number Selected : %d\n*',bitset->totalSelected); 
20 fprintf(stdeiT,"Actiffll Sizes : 

for ( i = 0 ; i < bitset- >numVariationSites ; i++ ) 

fipiintf(stdOT," %d bitset- >actuallSizes[i]); 
fprintf(stderT/\n*); 
fprintf(stdCTr."Alloc Sizes : "); 
25 for ( i = 0 ; i < bitset- >numVariationSites ; i++ ) 

fprintf(stderr."«d •,bitset->allocSizes[i]); 
fprintf(stderr/\n"); 
fprintf(stderr/Num Frags in X? : 

/* 

30 ** If the number of fragments is zero then we will write -1 to tell others 
** to calculate this themselves. 

*/ • 

for ( i = 0 ; i < bitset- >numVariationSites ; i++ ) 
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^rintf(stderr/96d ''.(bitset->nuinFiagsInEachSiten] == 0 )?-l: 

bitset- > nuniFragsInEadiSite[i]); 

fprintf(stderT."\n-); 
fprintf(stdOT, "Selections : \n"); 
5 indx = -1 ; 

do 
{ 

indx = IHBFindNextQne(bitset* > bitset,indx + 1); 
if(indx==-l) 
10 break; 

indxl = indx / bitset- >allocSizes[i] ; 
indx2 = indx % bitset* >aUocSizes[l] ; 
fprintf(stderr/%d %d\n",indxl + 1 ,indx2 + 1 ); 
} while ( 1 ); 

15 } 

void CS_PRDCT_BITSET_GET_HrrS( struct BitSetFileStruct *bitset , int **indexes) 
{ 

int i ; 
int indx ; 
20 int indxl ; 
int indx2 ; 
inthitNo = 0; 

indx = -1 ; 

do 

25 { 

indx = IHBFindNextOne(bitset->bitset,indx+l); 
if(indx==-l) 
break; 

indxl = indx / bitset- >allocSizes[l] ; 
30 indx2 = indx % bitset- >allocSizesn] ; 

indexes[0][hitNo] = indxl + 1 ; 
indexes[l][hitNol = indx2 + 1 ; 
hitNo++ ; 
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}while{l); 

} 



*♦ Function Name : CS_PRDCT_BrrSET_OPEN0 

mm 

** Purpose : Function will read in the header for a CS product bitsei. 

10 

** Usage : 



*4c 



Returns : A handle to the prxxiuct bitset info structure or NULL on 



** error. 
15 ** 

•* Algorithms : None. 
mm 

mm 



Revision History : 
20 ** Author Date Description 



mm 



Fred Soltarishahi 07/26/96 Original version. 
** 

25 **-E: 
*/ 

void *CS^PRDCT_BrrSET_OPEN( char *bitsetFileName , int offset ) 
{ 

struct BitSetFileStruct ♦bitset ; 
30 if ( !(bitset = ReadAndAllocate(bitsetFileName,offset)) ) 

return (void *)NULL ; 
bitset- >totalSelected = IHBCountOnes(bitset-> bitset/ 
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0. IHBBitSire(bitset->bitset)); 

/* 

If the program did not keep track of and output this to the fde thai we 
need to calculate it oursdves. 

5 */ 

if ( { bitset- > nuinFragsInEachSite[0] = = 0 ) j | ( bitset- > nuniFragsInEadiSite[0] 
= =-1)) 
{ 

CalculateFiagsInSties(bitset); 

10 } 

i^um (void *)bitset ; 

} 

. /* 
15 ♦* 

mm 

** Function Name : CS^PRDCT^BI'reET^CLOSEO 
** . ■ 

** Purpose : Function will close a bitset file and cleanup allocated. 

20 ** memory. 
«* 

Usage : 

♦* 

** Returns : None. 

25 ** 

** Algorithms : None. 
mm 

** Revision History : 

mm 

30 *♦ Author Date Description 
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** Fred Soltanshahi 07/26/96 Original version, 

**-E: 
*/ 

5 vdd CS_PRDCT_BrreErCLOSE{ struct BitSetBleStruct *bitset ) 
{ 

DeallocateBits^its^); 

} 

/* 

10 **+E: 



*♦ Function Name : CS_PRDCT_BITSET_WRITEO 
*♦ 

15 ** Purpose : Function will write a bitset into the given file. 
Usage : 

** Returns : 1 on success or 0 on failure. 
20 ** 

*♦ Algorithms : None, 



4t* 
4t« 



Revision History : 
25 ** Author Date Description 



•* Fred Soltanshahi 08/02/96 Original version. 

30 

*/ 

inl CS_PRDCT_BrrSET^WRITE(char ♦fileName,char *programName,struct 
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BitSetFileStnict '^productBitsetJnt progBufferSize,int *progBuffer) 
{ 

if ( !WriteOutCompressedBSFile(fileName. 



productKtset- > mast^FUelnfo.fnasteiFilePathName, 



10 



productBits^* > masterFilebifb. masterRecNo. 

progiamNaine, 

pioductBitset- > bitset, 

pioductBitset- > nuraVariationSites, 

productBitset- > actuaUSizes, 

productBitset- > allocSizes» 

productBitset- > totalSdected, 

productBitset- > numFragsInEachSite, 

progBufferSize, • 

progBuffer)) 



15 



goto AddTraceback ; 
return 1 ; 
AddTraceback : 

fpiintf(stden^/CS_PRDCT_BITSET_WRITEO--Una^^^ to write bitset fileVn"); 
return 0 ; 

20 } 



25 ** Function Name : CS^FRDCT^BITSET^CREATEO 



Purpose : Function will create an in-memory product bitset from a 
** master file. 



30 *♦ Usage : 
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4t« 



Returns : A handle to the product bitset info structure or NULL on 



** error. 

mm 

*♦ Algorithms : None. 

5 •* 

*• Revision History : 
*♦ 

♦♦Author Date Description 

10 = = = = = = = = = = = 

** Fred Soltanshahi 08/02/96 Original version. 

*♦ 

♦♦-E: 
♦/ 

15 void *CS_PRDCT_BITSET_CREATE(char *masterFileName, 

int masterRecNumber, 
int ♦initRawBitset) 

{ 

strua BitSetFileStruct ♦bitset ; 
20 if ( {(bitset = ReadAndAUocateMaster(masterFUeName, 

masterRecNumber, 

initRawBitset)) ) 

return (void ♦)NULL ; 

25 else 

return (void '*')bitset ; 

} 

/* 

•*+E: 
30 ** 

m* 



♦♦ Function Name : CS_PRDCT_BITSET_SETBITS() 



wo 97/27559 PCT/US97/D1491 

572 

** Purpose : Function wall copy a raw bitset into the ChemSpace product 
** bitset format. 

5 ** Usage : 

'^'^ Returns : 1 on success or zero on failure. 

** Algorithms : None. 
10 ** 

** Revision History : 
** 

** Author Date Description 

♦♦= = = = = = = = = = =: = = = = = = = = = = == = = = = 

15 = = = = = = = = = = = = = = = 

Fred Soltanshahi 08/02/96 Original vmion. 

mm 

•/ 

20 int CS_PRDCT_BITSET_SETBITS(void ♦bs, int *rawBS, int numProducts) 
{ 

strua BitSetFileStnict ♦bitset = (struct BitSetFileStruct *)bs ; 
void ^compressed ; 
static int firstTime = 1 ; 
25 int i ; 
int total; 

char ♦cp = (char *)rawBS ; 
int rowLength ; 
int indexl ; 
30 int index2 ; 
int byte ; 
int bit ; 

int totalSelected = 0 ; 
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if ( firstTime ) 
{ 

InitO; 

firstTime = 0 ; 

5 } 
/* 

Just create a new one. 

if (bitset->bits^) 
10 IHBDestroy(bitset- > bitset ) ; 

if ( !(bitset-> bitset = CreateCompressedBitSet(TawBS, 

0. 

bitset- > numVariationSites, 

15 

bitset- > actuallSizes, 

bitset- > allocSizes) ) ) 

goto UnableToGreateBitSet ; 
20 total = bitset- >actuallSizes(0] ; 

for ( i = 1 ; i < bitset- >numVariationSites ; i++ ) 

total *= bitset->actuallSizesti] ; 
if ( numProducts == -1 ) 

/♦ 

25 Calculate whatimxlucts are being set. 
*/ 

{ 

numProducts = 0 ; 
rowLength - bitset- >actuallSizes[l] ; 
30 for { i = 0 ; i < total ; i-f + ) 

{ 



byte = ( i ) / 8 ; 
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bit = ( i ) % 8 ; 
if ( cp[byte] & sctbits[bit] ) 
numProducts++ ; 

} 

5 } 

bitset->totalSelected = numProducts ; 
return 1 ; 
UnableToCieateBitSet : 

lprintf(stderr/CS_PRIXn'_BTOET_SETBITC Unable to set bit\n"); 
10 return 0; 

} 

** 

15 ** 

•* Function Name : CS_PRDCT__BITSET_TO_RAW Q 

** Purpose : Fum^cm will copy a GhemSpace product bitset to a 
** raw bitset format. 

20 

** Usage : calloc rawBS before call. useAIIoc nonzero to use allocated 
rather than actual dimensions 

** Returns : 1 on success or zero on £ailure. 
25 ** 

** Algorithms : None. 



Revision History : 
30 ** Author Date Description 
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David Patterson 09/09/96 Original version. 

mm 

♦/ 

5 int CS_PRDCT_BITSETTO_RAW (void *bs, int *rawBS, int useAUoc) 
{ 

(S_PRDCT_BrrSET_CONCAT_RAW(bs, rawBS; 0, useAUoc); 
return 1; 

10 int CS_PRDCT_BITSET_CONCAT_RAW(void *bs, int •rawBS, int offset, 

int useAIloc) 

{ 

int '*'indxs = 0; 
int address, sum, b; 
15 struct BitSetFileStruct *bitset = (struct BitSetFileStruct *) bs; 

for ( address = -1 , b = 0 ; b < bitset->totaISclected ; b++ ) 

{ 

address = IHBFindNextOne(bitset->bitset,address+l); 
BitSetAddressToIndexes(bitset,address,&indxs,0); 
20 if (useAUoc) 

FlagPn)duct(rawBS, 0,0, address+offset); 
else must exj^citly calculate the address */ 
{sum= CS_PRDCT_^BrreETJNDEXES_TO__INDEX( bitset, indxs) ; 
FlagProduct(rawBS, 0,0, sum+offset); 

25 } 
} 

UTL_MEM_FREE(indxs); 
return 1; 

} 

30 /• 
*•+£: 
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4i« 



Function Name : CS_PRIXT_BITSET_SELECTED Q 
5 ** Purpose : Function will return a ChmSpace bitset's totalSelected 

mm 

** Usage : 

mm 

** Returns : integet count of selected bits in bitset 
10 ** 

** Algorithms : None. 

mm 

*♦ Revisim History : 
mm 

15 ** Author E)ate Description 



** David Pattcaron 09/24/96 Original verrion. 

mm 

20 **-E: 
*/ 

int CS_.PRDCT_BITSET_SELECTED (void *bsvoid ) 
{ 

struct BitSetRlcSthict *bs = (struct BitSetFileStruct *) bsvdd; 
25 return bs-> totalSelected; 

} 

/* 
■ *♦ 

30 

*♦ Function Name : CS_PRDCT_BITSET_REVEAL 0 
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** Purpose : Function will return a CheniSpace bitset*s struct info to 
** external calling program. 

*♦ Usage : 
5 ** 

Returns : 1 on success or nem on failure. 

«« 

** Algorithms : None. 

10 ** Revision History : 
•* 

** Author Date Description 



15 ** David Patterson 09/10/96 Original version. 

**-E: 
*/ 

int CS_PRDCT_BITSET_REVEAL (void *bsvoid, 
20 char **MastCTFile_Bitset, 

int *StartRec_Bitsa, 

int *BitsInAbsentia, 

int ^BitsInAbsentiaNdCount, 

char **CoreFilc, 
25 int *StartCore, 

char **FngrFile, 

char »**Xfiles, 

int **nY, 

FILE **FngrFile_File, 
30 int •FingerOff. 

char **ScreenFileName, 
int ♦BytesPerFingerPrint, 
int *WordsPerFingefprint, 
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int ^'^query, 

int **FingerCore_FP. 

int *FingerCoie_Card ) 

{ 

S int i, size; 
int*ftxri; 

struct BitSetFileStruct = (struct BitSetFileStruct bsvoid; 
if (MasterHle_Bits^) 

*MasterFiIe_Bitset = bs->niasterFildnfo.niasterFilePathNaine ; 
10 if (Star«cc_Bitset) 

♦StartRec_Bitset =bs- > masterPilelnfo.masterRecNo ; 
if (BitsInAbsoitia) 

*BitsInAbsentia = bs- > mast^Filelnfo.numberOfMisstngBits; 
if (BitsInAbsentiaNoCount) 
15 ^BitsInAbsmtiaNoCount = bs->masterFileInfo.ibits; 
if(CoreHle) 

♦CoreFiie = bs- > masterPilelnfo.corefilePathName; 

if (StartCore) 

♦StartGore - bs->masterFileInfo.startCorTe; 

20 if(FngrFile) 

♦FngrFile = bs- > masterFilelnfo.fingerFileNamc; 

if(Xfiies) 

♦Xfiles = bs->inasterFileInfo.x_FacName; 

if(nY) 

25 *nY = bs->actuaIlSizcs; 

if(FngTFile_FUe) 

{ if (!((*FngrFile^FUe) = UTL_FILE_FOPEN((*FngrFile).-r-))) return 0; 

if (!UTL_FILE_FREAD(&i,sizeof(int),l.*FngrFUe_File)) return 0; /* nbits 

fp*/ 

30 *BytcsPerFingerPrint = ( i + 7 ) / 8 ; 

*WordsPerFingerprint = (*BytesPerFingerftint + 3) / 4; 
(♦query) = (int *) UTL_MEM_ALLOC( *BytesPerFingerPririt); 
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Olt •/ 



if (!UTL_Fn-E_FREAD(&i.sizeoffint).l,*FngrFile_File)) return 0; /* record 
if (!UTL_FILE_FREAD(&i,sizeofCmt),l,*FngrFile_File)) return 0; /• record 



aze */ 

S rewind(*FngrFile_File): 

if (!(fooi = Ont *) UrL_MEM_ALLOC( i ))) letuin 0; 

size = (3+OM ; 

for ( i=0; i < = *FingerOff; i+ +) 

if (!UTL_HLE_FREAD{ fooi.azeofCint).aze.*FngrFUe_File)) 
10 return 0; 

/* if ( fooi[l] ! = 2 + nY_01 • nY_02 ) return 0; */ 
if ( ScreoiFileName ) 
( 

if (!((*ScreenFileName) = UTL_STR_SAVE(fooH-4))) r«um 0 ; 

15 } 

if ( FuigerCorc_FP ) 
{ 

*FingerCorc_FP = fooi; 

if (!UTL_FILE_FREAD( FingerCore_Card.sizeof(int), 1 , 

20 •FngrFae_FUe)) 

return 0; 

if (!UTL_FILE_FREAD(*FingerCore_FP , 

sizeof(int), 

*WordsPerFingerprint, 

25 *FngrFile_File)) 

return 0; 

} 

} 

return 1; 

30 } 
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**+E: 

mm 
mm 

** Function Name : CS_PRIXrr_BrrSET_INDEXES jrO_INDEX0 
5 ** 

♦* Purpose : Function will return the right bit given a set of indices 

mm 

Usage : all indexes are 0 based. 
10 Returns : index to use in bits^. 

mm 

** Algorithms : None. 

** Revision Histoiy : extracted from CS_PRDCT_BrrSET_SET_PRD_Brr by 

15 ** David Patterson 

mm 

** Author Date Description 



20 ** Fred Soltanshahi 08/02/96 Original version. 

mm 

•/ 

int CS_PRDCT_BITSETJNDEXES_TOJNDEX( struct BitSetFileStnict *bitset, 
25 int ^indexes) 

{ 

int i ; 
int j ; 

int n)wLength[MAX_VARIATION_SITES] ; 
30 int indx = 0 ; 

for ( i = 0 ; i < bitset- > numVariationSiles ; i++ ) 

{ 

rowLength[i] = 1 ; 
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for ( j = i + 1 ; j < bitset'>numVariationSites ; j++ ) 
rowLength[i] *= bitset->actuallSizcsD] ; 

} 

for ( i = 0 ; i < bitseC->numVariationSites ; i++ ) 
indx += indexes[i] ♦ rowLengthH] ; 

} 

r^um indx ; 

} 

10 /* 
**+E: 

*♦ 

** Function Name : CS_PRDCT_BITSET_ALLOC_SIZE_INDEXES_TO_INDEX0 
15 *» 

•* Purpose : Function will return the right bit given a set of indices 
it uses the allocated sizes in the bitset to get the info. 

*» 

•/ 

20 int CS_PRDCT_BITSET_ALLOC_SIZEJNDEXES_TO_INDEX( struct BitSetFUeStnict 
♦bitset, 

int ^indexes) 

{ 

int i ; 
25 int j ; 

int n)wLength[MAX_VARIATION_SrrES] ; 
int indx = 0 ; 

for(i = 0; i < bitset- >num VariationSites ; i++ ) 
{ 

30 rowLengthpl = 1 ; 

for ( j = i -f 1 ; j < bitset- >num VariationSites ; j + + ) 
rowLength[i] *= bitset- >allocSizes[jl ; 
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} 

for ( i = 0 ; i < bitset->numVariati6nSites ; i++ ) 

{ 

indx += indexesH] * rowLengthp] ; 

5 ) 

r^um indx ; 

/* 

10 

mm 

** Function Name : CS_PRDCT_BITSET_SET_PRD_BrrO 
♦* 

Purpose : Function will set a product bit with the given indexes. 

15 ** 

** Usage : 

•* R^ums : none. 

20 Algorithms : None. 

** Revision History : 
mm 

Author Date £>escription 

25 = = = = == = = = = = = = = = == = = = = = = - = = = = = 



Fred Soltanshahi 08/02/96 Original version. 

30 */ 

int CS_PRDCT_BITSET_SET_PRD_BIT(void ♦bs, int *indexes) 
{ 
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Struct BitSetFileStnict ♦bitset = (struct BitSctFileStnict *)bs ; 
int indx = 0 ; 

indx = CS_PRDCT_BrreET_AUXX:_SIZEJhn)EXES_TOJITO 
indexes); ^ 

5 IHBS^(bitset->bitset, indx ); 

bitset- >totalSdected+ 4- ; 
r^mn 1 ; 

} 

7* 

10 •*+£: 

mm 
mm 

** Function Name : CS_,PRDCT_BrrSET_GET_RINFO0 

mm 

15 ** Purpose : Function will return the Reaction/Reagent info from 
♦* the bitset file, 

mm 

** Usage : 
20 *♦ Returns : none. 

mm 

Algorithms : None. 

*♦ Revi^on History : 
25 ♦* 

** Author Date Description 

01/03/97 Original version. 



** Fred Soltanshahi 
30 *• 

*/ 
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int CS_PRDCT_BrrSEr_GErRINFO(void ♦bs, char *'*ieactionInfo.cbar •**rcagentInfoj 
{ 

struct BitSetFileStruct *bitset - (struct BitSetFikStruct *)bs ; 
*Teacti(niInfo = bitset->masted?ildnfo.prefixFoiFaes ; 
5 '*TeagaitInfo = bitset->niasta^FileInfo.reagentInfo ; 

r^m 1 ; 

} 



/* 

•*+E: 
10 ** 



Function Name : CS_PRDCT_BITSET_GET_STATSO 



Purpose 

15 ** 



: Function will return the statistics for a bitset file, 
these will include numberOfSites, originalSizes, 
numberOfProducts and Number of fragments used at each 
variation site. 



Usage : 



20 

Returns 



none. 



mm' 



Algorithms : None. 



25 ** Revision History 



mm 

•* Author 
mm 



Date 



Description 



30 ** Fred Soltanshahi 



08/05/96 Original version. 
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*/ 

int CS_PRDCT_BITSEr_GET_STATS(void '^bs, int *nuinSites, int *numProducts. 
int **si2es, int **nuinUsed ) 

{ 

5 struct BitSetFileStruct *lHtset = (struct BitSetFileStnict *)i>s : 
^umSites = bitset->numVaiiati(HiSites ; 
*^iiinPioducts s bitset->lotalSelected ; 

/• 

Allocate buffers, if they have not been. 

10 */ 

if ( !(*sizes) ) 
{ 

if ( !((*sires) = (int ■^UTL_MEM_CALLOC(*nuinSites,si2eofCmt)))) 
goto UnableToAllocateMemory ; 

15 } 

if(!(*numUsed)) 
{ 

if ( !((*nuniUsed) = C»nt *)UTL_MEM_CALLOC(*numSites;sizeof(int)))) 
goto UnabteToAUocateMemory ; 

20 } 

nieni(^y(*sizes, bitset->actuallSizes, sizeof(int) * *numSites ); 

n>emcpy(*numUsed» bitset->numFragsInEadiSite, sizeof(int) * *numSites ); 

return 1 ; 
UnableToAllocateMemory : 
25 lprintf(stderr,*CS_PRDCT_BITSEr_GET_STATSO ~ Unable to allocate 

memoryVn"); 

return 0 ; , 

} 



/* 

30 **+E: 
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*♦ Function Name : CS^PREKrr^BITSET^COREJNFOO 

** Pmpose : Function will get the xfile and coie and xrstring info 
5 ** from the bitset file. 

mm 

Usage : 

mm 

** Returns : 1 on success or 0 on OTor. 
10 ** 

** Algorithms : None. 

mm 
mm 
mm 



Revision History 



15 **Autfior Date Description 



mm 



•* Fred Soltandiahi 08/09/96 Original version. 

mm 

20 **-E: . 
♦/ 

int CS_PRDCT.BITSET_COREJNFO(void *bs. char **masterName, int *masterRecno, 

char ♦♦ewe, char **xrString, int *numSites. char ***xFiIeNames ) 

{ 

25 return ( ReadBitsetCoreInfo(bs,masterName,mast^Recno, 

core,xrString,numSites,xFileNames)); 

} 

/* 

30 ** 

** Function Name : CS_PRDCT_BrrSET_PROG_NAME() 
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** Purpose : Function will get the program name that produced this bitset, 
** Usage : 

5 

Returns : 1 on success or 0 on error 
** Algorithms : None. 
10 Revision History : 

** Author Date Description 



15 ** Fred Soltanshahi 08/09/96 Original version. 

*/ 

int CS_PRDCT_BrrSET_PROG_NAME(void ♦bs, diar **programName) 
20 { 

*programName = ((struct BitSetFileStruct *)bs)->programInfo.programName 
return 1 ; 

/* 

25 **+E: 



Function Name : CS_PRDCT_MSTR_COREJNFO0 



30 ** Purpose : Function will get the xfile and core and xrstring info 
** from the master file. 
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** Usage : 



Returns : 1 on success or 0 on errw*. 
5 ** Algorithms : None. 
** RevisionJaistoiy : 

♦♦Author Date Description 

♦♦ Fred Soltanshahi 08/09/96 Original vcreion. 
*♦ 

♦♦-E: 
15 ♦/ 

int CS_PRDCT_MSTR_COREJNFO(char *masterFile, int index, char **core, char 

♦♦xrString, int *numSites, diar ♦♦♦xFileNames ) 

{ 

return ( ReadMasterCbreInfo(niasterFile,index 
20 ,core,xrString,numSites,xFileNames)); 
} 



7* . 

♦♦-hE: 

** 

25 ♦♦ 

♦♦ Function Name : CS_PRDCT_BrrSET_CREATE_Brr_STRINGO 
mm 

♦♦ Purpose : Function will create a compressed version of a raw bit set. 
♦♦ It returns the memory size needed to hold the bitset. 

30 ♦* 

♦♦ Usage : 
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*♦ Returns : pointer to a compressed bitset(this is not a ChemSpace 
** product bitsct but just a compressed bitstring) 

5 *♦ Algorithms : None, 
** Revision History : 

mm ■ 

Author Date Description 



mm 
10 



** Fred Sdtanshahi 08/06/96 Original ver^n, 

m^ 

15 */ 

void *CS_PRDGT_BrrSET^CREATE_BIT^STRING{ int *rawBits. int offset, int 

humVariations, int *sizes, int *allocSizes, int *totaISize) 

{ 

void ^compressed ; 
20 if ( !(compressed = CreateCompressedBitSet(rawBits, 



numVariations, 

25 



offset. 



sizes, 



allocSizes) ) ) 

goto UnableToCreateBitSet ; 
*totaISi2e = IHBRealSize(oomprcssed); 
return compressed ; 
30 UnableToCreateBitSet : 

fprintf(stderr/CS^PRDCT_BITSET_CREATE^BIT_ST^^ Unable to create 
bitsetVn"); 

return ( void *)NULL ; 
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} 

**+E: 

•* Function Name : CS_PRDCT_BlTSET.DESTROY_BIT_SraiNG0 

mm 

Purpose : Function will destroy the memory for a bitstring 
allocate by the CREATE call above. 

10 

** Usage : 

mm 

Returns : none 

mm 

15 ** Algoritfims : None. 

mm 

** Revision History : 

•* Author Date Description 

20 ** = = = = = = = = = === = = = = = = = = = = = = = = = 

** Fred Soltanshahi 08/06/96 Original version. 

mm 

**-E: 
25 */ 

void CS_PRDCT_BITSET_DESTROY_BIT_STRING( void *bitset) 
{ 

IHBPestroy(bitset); 

} 



30 /♦ 
**+E: 
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Functi<Mi Name : CS_PRDCT_BITSET_GtTHlTSO 



mm 

5 ** Purpose : Function wUl return the indexes(into the original X1,X2 files 
*^ for the requested number of hits, 

mm 

•* Usage : 

10 ** Returns : Number of hits found or -1 for error 
mm 

•* Algorithms : None. 

mm 

** RevisicH) History : 

15 

»*Autfior Date Description 

** Fred Soltanshahi 08/07/96 Original version. 

20 

- ♦*-£: 
*/ 

int CS_PRDCT^BITSET^GErHlTS( void *bs, int offset, int numberOfHits, int 
***hitlndcjies) 

-25 { 

struct BitSetFileStruct *bitset = (struct BitSetFileStruct *)bs ; 
int numFound ; 
int numConnections ; 
static int ^itAddresses = (int *)NULL ; 
30 static int numBitAddresses - 0 ; 
int *indxs = Ont *)NULL; 
int start ; 
int count ; 
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int i ; 
intj ; 

(♦hitindexes) - (int **)NULL ; 
numCcmnecdons = bitset->nuinVariaiionSites ; 

5 /* 

Local housekeqnng . 

*/ 

if ( numb^fHits > numBitAddresses ) 
{ 

10 if ( IbitAddresses ) 

{ 

if ( !(bitAddresses = (int *)UTL_MEM_CALLOC(numberOfHils, 

sizeof(int))) ) 

goto UnableToAllocate ; 

15 } 

else 

{ 

if ( !(bitAddiesses = (int *)UTL_MEM_REALLOC(bitAddresses, 

20 numboOfHits'*' sizeofOnt))) ) 

goto UnableToAllocate ; 

} 

numBitAddresses = numberOfHits ; 

} 

25 /* 

** Figure out if we have the number of hits he wanted and what their addresses 
*♦ are in the bitset file, 

mm 

** We will have to come back and speed this up if it is to slow, but for now 
30 ** lets get it working. 
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*/ 

/* Start = bitsct->firstHitAddress ; */ 

start = -1 ; /* start from the begining */ 
for ( count = 0 ; <5>unt < offset ; count++ ) 
5 { 

start - IHBFindNextOne(bitset->bitset,start+l); 

/* 

Lets remember where the first hit is, this should save us some time later. 

*/ 

10 if ( bitset->firstHitAddTess < = 0 ) 

bitset->firstHitAddress = start ; 
if(start==-l) 
{ 

return 0 ; 

15 } 

} 

/* 

** Now lets see how many bits are set from here on. 
♦/ 

20 for ( numFound = 0 ; numPound < numberOfHits ; numFound++ ) 

{ 

start = IHBFindNextOne(bitset->bitset,start+l); 
if ( start === -1 ) 
break; 

25 bitAddressesfnumFound] = start ; 

} 

/♦ 

** Allocate the arrays. 
*/ 

30 if ( !(*hittndexes = (int **)UTL_MEM_CAUj(X:(numConnections,sizeo ♦))) ) 

goto UnableToAUocate ; 
for ( i = 0 ; i < numConnections ; i + + ) 

{ 
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if ( !((*hiandexes)ril = (int *)UTL_MEM_CALLOC(nuinFound, 

sizeofCtnt ))) ) 

goto UnableToAllocate ; 

5 } 
/• 

** Now translate each one of the bitset addresses to the variation site 

** indexes. 

•/ 

10 for ( i = 0 ; i < numFound ; i++ ) 

{ 

BitSetAddressToIndexes(bitset,bitAddresses[i] ,&indxs,0); 
for ( j = 0 ; j < numConnections ; j++ ) 

(*hiandexes)[i][i] = indxsQ] + 1 ; /• Translate to 1 based indexes */ 

15 } 

if ( indxs ) 

UTL_MEM^FREE( indxs ); 
return numFound ; 
UnableToAllocate : 
20 AddTraceback : 
if ( indxs ) 

UTL_MEM_FREE( indxs ); 
return -1 ; 

} 

25 /* 

*^ Function Name : CS^PRDCT_BITSET_GET_PARTIALJirrSO 
30 ** 



Purpose : Function will return the indexes(into the original XI , X2 files 
for the requested number of hits. 
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** Usage : 
♦* 

*♦ Returns : Number of hits found or -1 for error, 
5 *♦ 

** Algorithms : None. 

** 

mm 



Revision Ifistory : 



10 •* Author Date £>escription 



** Fred Soltanshahi 08/07/96 Original version. 

mm 

15 **-E: 
♦/ 

int CS_PRDCT_BITSET_GET_PARTIAL_Hrra( void *bs, int *numProducts, int site, int 
numFixedSites, int *fixedSitesIndexes, int *nuniFragmentsPerSite, int **hittndexes ) 
{ 

20 struct BitSetFileStnict *bitset = (struct BitSetFileStnict *)bs ; 
int total ; 

(♦hitlndexes) = (int *)NULL ; 
GetPartiaIProductsStats( bitset , 

numFixedSites, 

25 fixedSitesIndexes, 

&total, 

numFiagmentsPerSite); 

(♦numProducts ) = GetPartialProductsAddresses(bitset, 

numFixedSites, 

30 fixedSiteslndexes, 

site, 

hitlndexes); 
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return 1 ; 



/♦ 



** Function Name : CS_PRDCT_BlTSEr_GEr_PRDCT_PARTIAL_HITSO 

** Purpose : Function will return the iiulexesOnto the original XI, X2 files 

10 ** for the requested numt)er of hits. 

^* Usage : This works when the csin is actually being exploded. 

** Returns : Number of hits found or -1 for mor. 
15 ** 

** Algorithms : None. 



Revision History 



20 ** Author 



Date Description 



** Fred Soltanshahi 08/07/96 Original version. 

*♦ 

25 ♦*-£: 
♦/ 

int CS_PRDCr^BrrSET^GET_PRDCT^PARTIAL_HrrS( void *bs, int *numProducts, int 
site, int numFixedSites, int *fixedSttesIndexes, int *numFragmentsPerSite, int **hitlndexes 
) 

30 { 

struct BitSetFileStruct *bitset = (struct BitSetFileStruct *)bs ; 
int total ; 
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(*hitlndexes) = (int *)NULL ; 
GetPaitia!ProductsStats( bitset , 

numFixedSites, 
fixedSitesIndexes, 
S &total, 

numFragmaitsP^ite); 

(^umProduct^ ) = GetPaitialPioductsAddresses(bitset, 

numFixedSites, 
fixedSitesIndexes, 
10 site, 

hitlndexes); 

{♦numProducts ) = total ; 
return 1 ; 

} 

15 /*+E 

Abstract: For Chemspace bitset file call callback with products choices not selected. 
Input: 

1. This function takes a BitSetFileStruct returned most likely from: 
CS_PRDCT_BITSET_OPEN(char *filenaine) 
20 2. A void pointer which is passed to callback function. This is for 
whatever you want. 
3. A pointer to fiincdon returning: 

int (void ^data, int numVariants, int *choices ), 
choices is of size num Variants, the choices are zero based, and 
25 choices[0] is the choice for mariaish Y_01, choice[l] for Y_02 etc. 

NOTE 1: num Variants of -1 and a null for choices is passed to signify 
the end of the choices excluded, just in case the function 
want to do some special processing at the end. 
NOTE 2: The retum value from the callback function is ignored. 
30 Returns: 

Total number of bits excluded. 
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Description 



5 **Robrilek 07/26/96 Original version. 

*/ 

int CS_PRDCrBITSET_ZERO(stnict KtSetFileStnict *bitset, void *udata, 

int (*ZeroProducts)(void ""udata, int numVariants, int "^choices ) ) 

{ 

10 BIT.TRACKING bt[l]; 

if ( bitset->numVariaUonSites < 0 ) 
return -1; 

bt*>num Variations = bitset->numVariationSites; 
bt->bilset>= bitset; 
15 bt->call_udata = udata; 

bt->funcptr = ZeroProducts; 

bt-> choices = (int *) UTL_MEM_CALLOC(bt->numVariations, sizeof(int) ); 
bt->totalExcluded = 0; 

I* The sequence is as follows: 
20 IHBRange has a loq) to find zeros/ones. 

It calls RangeCallback 

RangeCallback calls ZeroProducts callbacle 
while ( not end of list ) 

call RangeCallback witii start and end Range. 
25 for ( i = startRange; i < = EndRange; i+4- ) // 

RangeCallback 

calculate product array. 

call ZeroProducts // 

ZeroCallback 
30 */ 

rHBRange(bitset->bitset, 0, (void *) bt. RangeCallback ); 
UTL_MEM_FREE((char *) bt-> choices ); 
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-1 upon error. 
** Author Date 
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return bt- > totalExcIuded; 

} 

Synop^: Gets called for each lange of bits set. It thai 
S omverts each bit to a product array and calls callback for each. 
*/ 

static int RangeCallback ( void *udata, int staitRange, int endRange ) 
{ 

Brr_TRACKING *bt = (BIT_TRACKING *) udata; 
10 int indx; 

intoor; 
int skip; 

void *call_udata; 
int numVar^ 
15 int *choices; 

void ♦bitset; 

call_udata = bt->call_udata; 
numVar = bt->num Variations; 
choices = bt-> choices; 
20 bitset = bt-> bitset; 

for ( indx = startRange; indx < = endRange; ) 
{ 

skip = BitSetAddressToIndexes(bitset,indx,&choices,&oor); 
if(!oor) 
25 { 

(*bt-> funcptr)(call_udata, numVar, choices ); 

bt- > totalExcluded+ 4- ; 

indx++; 

} 

30 else 
{ 

if ( skip > 0 ) 

indx += skip; 
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dse 

indx4-+; 

} 

} 

5 (*bt->ftinq)tr)(call_udata,-l, Cmt *) 0 ); /♦ Signify end of zeros, */ 

r^um 0; 

} 

/* +E 

Abstract: For CHiemspace bitset file call callback with products choices selected. 
10 Input: 

1. This function takes a BitSetFiieStnict returned most likely from: 

CS_PRDCT_BrESET_OPEN(char ♦filename) 

2. A void pointer which is passed to callback function. This is for 

whatever you want. 
IS 3. A pointer to function returning: 

int (void *udata, int num Variants, int ^choices 
choices is of size numVaiiants, the choices are zero based, and 
choicesfO] is the choice for maricush Y_01. choice(ll for Y_02 etc. 
NOTE 1: num Variants of -1 and a null for choices is passed to signify 
20 the end of the choices excluded, just in case the function 

want to do some special processing at the end. 
NOTE 2: The return value from the callback function is ignored. 
Returns: 

Total number of bits included. 
25 -1 upon error. 

See Also: CS_PRDCT_BITSET_ZERO 

** Author Date Description 



30 



** Rob Jilek 08/19/96 Original version. 

*/ 

int CS_PRDCT_BITSET_ONE(struct BitSetFileStruct *bitset, void *udala. 
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int (^OnePioducts)(void ^data, int numVariants, int *choices ) ) 
{ 

BIT.TRACKING bt(l]; 

if ( bitset->nuniVariationSites < - 0 ) 
5 return -1; 

bt->nuin Variations » bitset->numVariationSites; 

bt->bitset " bits^_ 

bt->cali_udata udata; 

bt->finK:ptr = OneProducts; 
10 bt-> choices = (int *) UTL_MEM_CALLOe(bt->num Variations, sizeof(int) ); 

bt->total£xcluded = 0; 

IHBRange(bitset->bitset, 1, (void ♦) bt, RangeCallback ); 
irrL_MEM_FREE((char *) bt-> choices ); 
return bt->tota]Excluded; 

15 } 
MO 

main(argc,argv) 
tiit argc ; 
char *argvD ; 
20 { 

void *h ; 

char ^nuisterFileName = 

"/home7/ftcd/work/ADS/dserv/source/dbcsln_des/TestData/W^ ; 
int masterRecNumber = 1 ; 
25 int ♦bitset ; 

int size = (300*400 + 7)/8; 
int i ; 
int j ; 

int indexes[2] ; 
30 char hold[81]; 
#if 1 

if ( !(h = CS^PRDCT_BITSET^OPEN(argv[l),0))) 
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( 

(printf(stderr,"Unable to open the bitset file %s\n",argv[l]); 
exit ; 

} 

5 CS_PRDCT_BITSET_DUMP(h); 

CS_PRDCT_BITSET_CLOSE(h); 

#else 

if(!(h = 

CS_PRIXrr_Bn^ET_CREATE(inasterFdeName,inasterRecNumber,Nmx ) ) 
10 { 

'l^rintf(stdeiT/Unable to create bitset for %s\n%nmterFiteNan^ 
exit; 

} 

GS_PRDCrr_BrrSET_WRrrE(-Test.bs",-MyProg-.h,0,NULL); 

15 indexes[0] = 59 ; 

indexcs(l] = 129 ; 
CS_PRDCT_BlTSET_SET_PRD_Brr(h,indexes); 

indexesIO] = 159 ; 

indexes[l] = 241 ; 
20 CS_PRDCT_BrrSET_SET_PRD_BrrOi,indexes); 

CS_PRDCT_BITSET_WRITE("Tcst2.bs*,"MyPiog",h;o,NULL); 

bitset = (int •)UTL_MEM_CAUX)C(size,sizeofant)); 

bitset[^] » 49 ; 

bitsetflj = 99 ; 
25 CS_PRDCT_BITSET_SBrBrrS(h,bitset,-l); 

CS_PRI>CT_BITSET_WRITE("Testl bs" ,"MyPiog-,h,0,NULL) ; 

CS_PRDCT_BITSET_CllOSE(h); 

#endif 
} 

30 #endif 
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•/ 

/* lopsiin */ 

*f 
m 

* This program detennines which csln "pnxlucts* are similar to an input 
10 * structure, where similarity occurs if the sum of differ&Kes in encoded 

* •CoMFA* fidds is less than some threshold. 

* The csln components are referenced in a master file with 

* one multiline record per cSLN. Record format is 

15 * Reaction class xxxx (vfhem "Reaction class* is a literal) 

* reaction_name 

* numberjof_sv_sites 

* missing_bits_count 

* hashed_<mly_missing_bitsjcount 
20 * core_filename 

* core_filcname_index_of_core 

* fingerprint_filename 

* offset_into_fingerprint_file 

* first_svjae_Xl 

25 * secod_sv_file_X2 (etc if more than two sv sites) 

* NOTE - ALL subsequent entries in the master file whose Reaction class 

* matches the Reaction class of the record referenced by -index are also 

* processed! TMatching" implies matdiing of possible other input symbols 
30 * to components of the Reaction class line,) 

* The input structure is read as encoded fields from stdin (or 

* a named file if provided), one field per line. There 
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* must be provided (by a SYBYL SPL script), in order: 

* "numb«'_of_sv_ates* * "numbCTjof_fidd_types" fidds describing the "core" of the 
query 

5 * "number_of_sv_sites* - 1 sextets of idadve coootinates of corp attachment atoms 

* •numberjof_sv_sites" * "number_of_fieId_types'' fidds describing the "side chains" 



10 



Options: 



-master name 
-bitset name 



- name is the file with master file records 
- name is a result of an earlier search operaticm 
(use EITHER master or bitset) 



* -index number 



15 



20 



-reaction name 



-details name 



- which sequential record in mast^ file to b^in at 
OR oiKset into bitset in a bitset file 

(de&ult = 1) 

- records in master file to be processed must have this 
class name 

- if provided, records in master file to be processed 

must have any one of diese tokens following its class name 



25 



* -distance tan 



-coowdght cwt 



- tan is the overall similarity threshold 
(defeult is 90.0) 

- wdght of the core attachment coordinates, 
relative to fidds 



30 



-nocore nocore 



- do not consider core topomer differences 
By default these are considered (required) 



-allcores allc - process all cores in the core file 
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* By default only one core (index in the master file) is processed 

* -maxhits max - stop when max hits are found (defoult infinity) 
5 * -input filename - name of file with quoies (defoult stdin) 

* -ou4>ut filename - qiecifies the wtpui file for the hit info 



4c 



* This flag forces the diq^y of all 

10 * options 



4c 



I 

IS ^include <stdio.h> 

Anclude <signal.h> 

jKnclude <ctype.h > 

il^mclude < unistd.h > 

#iiiclude <string.h> 
20 ^include <sys/stat.h> 

#include <math.h> 

include Jparse(q)Lh" 

ftndude "utl^str.h" 

jfinclude "utl^mCTi.h" 
25 ^include "utl_file.h- 

#mc]ude "utl^math.h" 

/include -cth- 

Anclude "ct_expr.h" 

#include "ct j>ioto.h* 
30 #defuie GoodExit 0 

#defme ErrorExit 1 

#define Visual(s) { fprintf s; } 
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Static FILE 
Static char 
static diar 
static int 

S static char 
static char 
-Aeoid 
static int 
static FILE 

10 ^tic int 
static char 
^ticFILE 
static char 
static char 

IS static int 
static char 
static char 
static double 
static double 

20 static char 
static int 
static char 
static double 
static char 

25 

static char 
. static char 
static FILE 
static char 
30 static char 
static char 
static double 
static double 
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*Ouq)utFile = 0; 
♦OutputFileBase; 
OutputFileName[200]; 

nOutFiles = 0; /♦ number of output files */ 
♦MasterFile = 0; 
*BitsetFile = 0; 
*bitset; 

MasterRecoid = 1; 

♦MasterFile^File; 
StartCoie; 
*InputSource = 0; 

*InputSourceFile; 

«ReactionNeeded = 0; 

*ScratchDetails = 0; 

nOetail = 0; 

•♦ReactionDetails = 0; 

♦XWeights =0; 

*RWcights = 0; 

CoreWcight; 

♦FieldTypes = 0; 

nFType = 0; 

♦♦FTypes = 0; 

*FWeights = 0; 

**FOrder = 0; /* temp, for recording L->R order of data 
in side chain SLN */ 

**FROrder = 0; 
♦Corefile = 0; 

♦CoreFile^^File; 
*CoreNow; 

**Xfile; 

**Xname; 
Distance == 90.0; 
CoreDistance = 0.0; 
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Static double 

stadc double 

static double 

static double 
S static double 

static double 

stadc int 

static int 

static int 
10 static int 

static int 

static int 

static int 

static unsigned char 
IS static unsigned char 
static unsigned char 
static unsigned char 
static double 
static int 

20 

static int 

static int 

s^tic int 
2S static int 

static int 

static int 

static int 

static int 
30 static int 

static int 

static int 

static int 
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DWeight ^1.0; 
Dist[161(161; 
boundary[16]; 
CXcoords[6], CXdiffsq[6]; 
searched 0.0; /* number searched */ 
combi = 1.0; /* number of side chain combos "^1 
totnout s= 0, nout = 0; /* number of products */ 
*Good_Pn)ducts = 0; /* product bit set */ 
♦Dead^Products - 0; /* forbiddai product bit set */ 
nR; /* numb^ of R positions (usually 2) */ 
*nX; /* number of product dimensions */ 

*Xct; /* used for indexing over all products */ 
**Xsi2e; /* bytes p^" fidd */ 

****X = 0; /* csln field (F x R x nX )*/ 
***Xin; /* target fields */ 
*♦•¥; /* csln core fields (F x R ) */ 
***Yin; /♦ target core fields */ 
***X2Y; /* distances between X and X' */ 
nSym, /* number of symmetries in this core */ 

♦CoreSyms, /* flags for all matching core symmetries */ 
**SymList; /* symmetry mappings */ 
DefaultSym[9] = {0. I, 2, 3, 4, 5, 6, 7, 8}; 
ReverseSym[2] = {1, 0}; 
AppendToOuq)utFile ^ 0; 
NoMordiitsPlease - 0; 
User Aborted; 
NoCorc = 0; 
AllCores = 0; 
CoreOK = 0; 
CorelsSame = 0; 
SideCh^nOnly = 0; 
SideChainsAreSame = 0; 
NotBitOutput = 0; 
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Static char comline[2048]; 
static struct ParseOptions OptionsQ = { 

{^master", ParseOptString, &MasterFile, 
"Prefix for all input files" }, 
S {"bitset", ParseOptString, &BitsetFile. 

"Name is the file with bitset records" }, 
{"distance", ParsepptDouble, &Di^ce, 

"Field similarity threshold (default 90.0)" 
{"coowdght", ParseOptDouble, &DWeight, 
10 "Core coord wt, relative to fields (default )" }, 

{"index", ParseOptInt, &MasterRecord, 

"Which Mast^Record entry 1-n" }, 
{"maxhits", ParseOpflnt, &NoMorehitsPlease, 
"Maximum number of hits before stopping" }, 
15 {"nocore", ParseOptInt, &NoCore, 

"Use -nocore to override inclusion of the core differences" }, 
{"allcores", ParseOpUnt, &AllCores, 

"Use -allcores to search all cores provided" }, 
{"input", ParseOptString, &InputSource, 
20 "File from which queries will be read( default stdin). "}, 

{"output", ParseOptString, &OutputFileBase, 

"File to which hit info will be written. "}, 
{"notbits", ParseOpttnt, &NotBitOutput, 

"Use notbits to output as index ASCII instead of std bitset." }, 
25 {"reaction", ParseOptString, &ReactionNeeded, 

"Reaction class for topomer search. "}, 
{"details", Par^OptString, &ScratchDetails, 

"Details further discriminating the reaction class. "}, 
{"sidechain", ParseOptInt, ASideChainOnly, 
30 "Use sidechain to search for similiarity in a single sidechain only. 

{"fieldlypes", ParseOptString. AFieldTypes, 

"Names of all field types (optional prefix =weight), space separated. Does 
CTOPS if none provided."}, 
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{"xweights", ParseOptString, &XWdghts, 

"Wdghts of varying sites. Must be nR(+core?) individual wdghts present (if 

any)."}, 

5 int UBS^OUTPUT^MESSAGEO { return 0; } /* just for compiling OK */ 
int UIMS2.WRrrE_PHarO0 { return 0; } 

int lowercase (s) char *s; {while (*s) { if isui^rf^s) *s = tolowcr(*s); s++;}} 
static int ParseArguments( aigc, argv ) 

10 * 

This function parses the command line arguments. 

* Returns: 1 on a successful command line parse, 0 otherwise. 
15 * Warnings: 

* Errors: 

* Author Date Description 

* G. B. Smith 02-09-93 (Mginal Version 

int argc; 
25 char **argv; 

{ 

int nargs» 

noptions ~ sizeof( Options )/sizeof(Options[0]); 
nargs = UTL_PARSE_OPT( argc, aigv, noptions, Options ); 
30 if( tnargs ) goto SyntaxError; 

return 1; 
SyntaxError: 

fprintf( stderr, "Bad command line argument(s)\n" ); 
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return 0; 

} 

static int OpenOutputFUeQ 
* R^ms: 1 on sucesss, dse 0 

♦/ 

{ 

10 char *msg; 

HLE *fp; 
OutputFile ^ stdout; 
if( OutputFilcBase) 

15 MakeOutputFileNameO; 
/* 

We need to create output files under the ownership of the REAL user not the 
♦* EFFECTIVE user. This only applies if setuid options are activated. 
*/ 

20 { 

struct Stat statBuff ; 
int uid ; 
int euid ; 

uid = getuidO ; 
25 euid = geteuidQ; 

stat(Ou^tFileName, &statBufO; 

/♦ 

** There are two cases 
** (I) the file to ou^ut to exists 
30 ** Use the ownership of the cunent owner of the file or if you cant do that 
** do not do anything. 
*♦ (2) The file is being created. 
** use the ownership of the REAL user. 
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*/ 

if ( access(OuQ)utFileNanie, F_OK) = = 0 ) 
{ /* If the file exist and the real user is the owner of the file 
if ( statBuff.st^uid == uid ) 
5 set^id(uid); 
} 

else 

{/* Create the file as the REAL user */ 
s^uid(uid); 

10 } 
} 

OutputFile = fopen( OutputFileName, (AppendToOutputFile?"a":"wb*)); 
if( iOutputFUe ) { 

fprintf(stderr/ErTor: Failed to open output file \''%s\*\n", 
15 OuqjutFileName ); 

goto EnorRetum; 

} 

} 

return 1; 
20 ErrorRetum: 

return 0; 

} 

static int What^FheDifferenceQ 

builds distance lookup table and initializes default symmetry data structure *l 

25 { 

inti.j; 

^define pow2(a) ( (a) * (a) ) 

/* the assignment of codes is based on the following (from gen_pls.c): 
static fptcutoffll6] = {9999., 0., 2., 4., 6., 8., 10., I2„ 
30 14., 16., 18.. 20., 22., 24., 26., 30. }; 

*/ 

boundary[0] = 9999.; /* missing data ought never to occur. */ 
boundary(l] = -0.1 ; 
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for(i==2;i< 15;i++) 
boundary[i] = 2*i-3; 
boundary [15] = 30.0; /'*' this is a steq) curve with a cutoff at 30! 
for a==0;i<16;i++) for O=0a< 16a++) 
S Dist(i]Q] = pow2( bou(idaty[i] - bcmndaiyQ]); 

Distance *= Distance; /♦ want to test D^ directly */ 
DWeight DWcight; 

I* allocate once for all conceivable symmetry reordoings *l 
if (!(SymList = (int **) UTL_MEM>LLOC( sizeof( int *) ♦ nR • (nR - 1) / 2) )) 
10 return 0; 

if (!(CorcSyms = Ont *) UTL_MEM_ALLOC( sizeof( int ) * nR * (nR - 1) 7 2) )) 

return 0; 
SymList[ 0 ] = DefaultSym; 
SymUst[ 1 ] = ReverseSym; 
15 rrtum 1; 

} 

static int ReadAField( hex, index, pXP ) 
/* converts field from external (ASCii hex) format to internal */ 
char *hex; 
20 int ♦index; 

unsigned char **pXP; 

{ 

int words, hold; 
char next2[10], *nxhx; 
25 words = strlen( hex ) / 2; /* assuming 8-bit bytes */ 
if (! *index ) ♦index = words; 
if ( words ! = *index ) { 
/* bad field (most likely hRJLL), ccmtinue anyway */ 
*pXP = (unsigned char NULL; 
30 return 1; 

} 

if (!(*pXP = (unsigned char *) UTL_MEM_ALLOC(words) )) return 0; 
for (words=0, nxhx = hex; words < *index ; words++) { 
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meinq>y(next2, nxhx, 2); 
nxhx + = 2; 

sscanf( next2. •%2x*, &hoId ); 

*(*pXP + words) « (unsigned char) hdd; 

5 } 

return 1; 

) 

static int RetrievebiputQ { 

/* reads the search pattrni fields (goierated by SYBYL sOTpt) ♦/ 
10 int index, R, F; 
char *iine; 
double atofO; 

if (ilnputSource) InputSourceFile = stdin; 
else if (!(InputScHirceFile = fopen( InputSource, "r" ) )) { 
15 fprintf{ stdout, "Could not open -input file %s\n", InputSource ); 

return 0; 

} 

if (!(Yin = (unsigned char ***) UTL^MEM_ALLOC( si2eof( unsigned char **) nFType 
))) 

20 return 0; 

for (F = 0; F < nFType; F++) ( 
if (!(Yin[ F ] = (unsigned char **) UTL_MEM^ALL(X:( sizeof( unsigned char *) * nR 

)» 

return 0; 

25 memset( Yin[F], 0, sizeof( unsigned char *) * nR ); 
} 

if (INoCore) { 

/♦ field types are paired closest! */ 

for (index = 0; index < nR; index ++) for (F = 0; F < nFType; F++) { 
30 /* a Field is on a single line, no parsing needed */ 

if (-1 == UTL_SCAN_GETS{ InputSourceFile, "W", &Iine)) 
return 0; 

if (!ReadAField( line, Xsi2e[ F ] + index, Yin[ F ] + index )) return 0; 
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} 

for (index = 0; index < 6; index++) { 

if (-1 UTL_SCAN_GBrS( InpuCSouiteFile, "W. &line)) 
return 0; 

S CXcooids( index ] = atof( line ); 

} 

if (!(Xin = (unsigned char ***) UTL_MEM_ALLOC( sizeof( unsigned char **) * nFType 
))) 

return 0; 

10 for (F = 0; F < nFType; F++) { 

if (!(Xin[F] = (unsigned char •*) UTL_MEM_ALLOC( sizeof( unagned char *) * nR 

))) 

return 0; 

memset( Xin[Fl, 0, sizeof( unsigned char *) * nR ); 

15 } 

for Cindex = 0; index < nR; index+ +) for (F = 0; F < nFType; F+ + ) { 
/** a Hdd is on a single line, no parsing needed */ 

if (-1 == UTL_SCAN_GErS( InputSouroeRle, "W", &line)) 
return 0; 

20 if (!ReadAField( line, Xsize[ F ] + index, Xin( F J + index )) return 6; 

} 
} 

fck>se( InputSourceFile ); 
return 1; 
25 } 

static int InitCoreQ { 

/* readies core file and its input arrays */ 

int R, i, F; 

char •foo; 

30 

if (! (CoreFile_File = fopen(Corefile,"r"))) { 

fiprintf( stderr, "%s Core file not foundAn", Corcfile ); 
return 0; 



Wd97/7»S9 PCT/US97/01491 

615 

} 

i=0; 

while ( i < StartCoie ) 
{ 

5 if ( -1 = = UTL_SCAN_GErS( CoreFile_File, "W, Afoo)) return 0; 
if (AUCoies) break; 
i++; 

} 

OhcNow = UTLJSTR_SAVE( foo ); 
10 /* iftidalize core data structures */ 

if (!(Y = (unsigned char *•*) UTL_MEM_AIXOC( sizeof( unsigned char •*) * 
nFType)) ) 

return 0; 

for (F = 0; F < nFType; F++) { 
15 if (!(Y[F] = (unsigned char •*) UTL_MEM_ALLOC( Mzeof( unsigned char •) • nR)) ) 

return 0; 
for(R = 0; R < nR; R++) 
if (!( •( (Y(F]) + R ) = (unsigned char *) UTL_MEM_ALLOC( sizeof( unsigned 
char) 

20 • ('*XM2e[F 1) + R ) )) return 0; 

} 

return 1; 

} 

int CountUnesO 
25 { 
int i; 

char *foo; 

/* note that CountLines returns one less than the actual number *l 
i=0; 

30 while ( -1 != UTL_SCAN_GETS( InputSouiceFUe, "W", T, &foo)) i+ + ; 
rewind(InputSourceFile); 
return i; 
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} 

Static int initXanays 0 

{ ~ 
int F, i; 

5 if (!(Xfile = (char **) irrL_MEM_ALLOG{ sireof( char* ) * nR ))) return 0; 

if (!(Xname = (char **) UTL_MEM_ALLOC( meoi{ char* ) • nR ))) return 0; 

if (!(nX = Ont*) UTL_MEM_ALUX:( sizeof( int ) * nR ))) return 0; 

if (!(Xct = Ont*) UTL_MEM_ALLOC( sireof( int ) * nR ))) return 0; 

for (i = 0; i < lA; i++) { XfUeTO = 0; Xnamep] = 0; nXp) = 0; Xctfi] = 0; } 
10 if (!(X = (unsigned char •***) UTL_MEM_ALLOC( azeof( unsigned char »•*) * 

nFType))) 

return 0; 

for (F = 0; F < nFType; F++) { 

if (!(X[F1 = (unsigned char ••*) UTL_Ma4_ALLOC( sizeof( unsigned char **) 

15 *nR))) 

return 0; 

nieniset( X[F], 0, sizeof( unsigned char **") * nR ); 

} 

if (!(Xuze = (int •*) UTL_MEM_ALLOC( azeof( int • ) * nFType ))) return 0; 
20 for (F - 0; F < nFType; F++) { 

if (!(Xsize[Fl = (int •) UTL_MEM_ALLOC( Mzeof( int ) * nR ))) return 0; 
for 0 = 0; i < nR; *(Xsi2e[F] + i) = 0; 

return 1; 

25 } 

static int initXfiles( i, SideChainsAreSame ) 

/* reads X file data (reactant descriptors from 2nd comment line of X file ) */ 
int i, *SideChainsAreSame; 

{ 

30 char *foo, •pch; ^ 

if ( -1 == UTL_SCAN_GETS( MasterFile_File. "W", &foo)) return 0; 
if (Xfile[il) { 
/* if this X file is same as last, nothing to do */ 



W097/27SS9 PCTA;S97/0149I 

617 

if (!sticinp( Xfile[ i J, foo ) ) return 1; 
*SideChainsAieSaine = FALSE; 
UTL_Mai_FREE( Xfilep] ); 

) 

5 XfiieC i ] = UTL_STR_SAVE(foo): 

if (i (InputSouiceFile « fopen(Xfile(i].-r"))) { 

ipiintf( stdout, "Could not open variation file %s\n', XfileG]J;_ 
return 0; 

} 

10 /* reading COMMENT lines to get USER^NAME value for matdiing */ 

if ( -1 UTL_SCAN_GETS( InputSourceFUe, 'W, "", &foo)) return 0; 

if ( -1 == UTL_SCAN_GETS( InputSourcePile, "W", &foo)) return 0; 

if (Xnameni) UTL_MEM_FREE( Xname(i] ); 

Xnanie[i} - 0; 
15 pch = strstr( foo, "USER_NAME=" ); 

pch += strien( "USER_NAME=" ); 

if ('(XnameO] = UTL_STR_SAVE( pch ) )) return 0; 

fclose( InputSourceFUe ); 
. return 1; 

20 1 

int StartFromBitsetO 
{ 

v<Md *CS_PRDCT_BITSET_OPEN0; 

if ( !( bitsct = CS_PRDCT_BrrSET_OPEN( BitsetFilc, MasterRecord))) return 0; 

25 

if ( !RetrieveMasterFileFromBitset(bitset. 

&MasterFile, 

&MasterRecord, /*in master file*/ 
0. 

30 0. 

0, 
0, 
0, 
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0. 
0. 

0, 
0, 

5 0. 

0, 
0, 
0. 
0. 

10 0. 

0. 

0 ) ) return 0; 

return 1; 

} 

15 /* 1/7/97 DEP: allow reading of bitsets. Since the tnasterfile must be 

read in aiiy case, the bitset only generates "Dead^Products" '*'/ 
int InitMastoFileO 

/* Read the mastar file record which is requested; 

failure if it does not matdi the input line info ^/ 

20 { 

int i, d, size, rxMatch, irx, ns, *Sym; 
chsu- *foo; 
int *fooi; 

if (Bits^ile && ! StartFromBitsetQ) return 0; 
25 if (! (MasterFile_File = fop«i(MasterFUe/r"))) { 

fprintf( stdout, *%s (master file) not foundAn", MasterPile ); 
return 0; 

} 

rxMatch = irx = 0; 
30 whUe ( IrxMatch) { 

if ( -1 == UTL_SGAN_GETS( MasterFile_File, -\\", &foo)) return 0; 
if ( strstr(foo, "Reaction class ")) { 
irx++; 



wo y7/27559 PCTAJS97/01491 

619 

if (bits^ && irx > Mast^Record) return 0; /* the right record did not niatch */ 
I* prdiminary niatch if (1) Reaction Needed matches and (2) 

NOjxwe must be present if NoCore is TRUE (or vice versa) */ 
rxMatch = ( irx > = MasterRecord && str5tr( foo, ReactionNeeded ) 
5 && ((INoCore && !strstr( foo, •NO_coie'' ) ) 

j j ( NoCore && strstr( foo, •NO_core* ) ) ) ); 

} 

/* if preliminary match, check rest of .mf record - first # reactants */ 
if (rxMatch) { 

10 /* skip name, record / compare number of reagents */ 

if ( -1 == UTL_SCAN_GETS( MasterFile_File, -\\% &foo)) return 0; 
if ( -1 == UTL_SCAN_GETS( MasterFile_File, "W", &foo)) return 0; 
if ( ! UTL_STR_ATOI(foo. &d) ) return 0; 

if (!nR) { 

15 if (SideChaiiiOnly && d != 1) { 

fprintf( stdout, "Side Chain only but .mf file referaices more than 

one side chainAn" ); 

return 0; 

20 nR = d; 

if (!initXarraysO) return 0; 

} 

rxMatch = nR = = d; 

} 

25 if (rxMatch) { 

/• sldp fgpt stuff, record core and side chain file stuff */ 

if ( -1 == UTL_SCAN_GETS( MasterFile_File, "W", &foo)) return 0; 

if ( -1 == UTL_SCAN_GETS( MasterFile_File, "W", &foo)) return 0; 

if ( -1 == UTL_SCAN_GETS( MasterFile_File. "W", '»', Moo)) return 0; 
30 if (Corefile) XJTL_MEM_FREE( Corefile ); 

Corefiie = lITL_STR_SAVE(foo); 

if ( -1 == UTL_SCAN_GETS( MasterFile_File. "\\", &foo)) return 0; 
if ( ! UTL_STR_ATOI(foo, &StartCore ) ) return 0; 
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if ( -1 == UTL_SCAN_GErS( MasterFUe_File, "W, &foo)) return 0; 
if ( -1 == UTL_SCAN_GETS( MaslCTFile_File, "W", &foo)) tetvm 0; 
for (i B 0; i < nR; if (!initXfiles( i, ASideChadnsAieSame ) ) return 0; 

} 

S } /* read .mf file until we have a matching reaction */ 
return 1; 

} 

Static int ReadXsO { 

leads all topmer Adds from all current Xn files V 
10 int R, F, i, n, ns, realloc. Fd; 
char *CTOPS, *line, *fptr; 
double *dp, **sdptr; 
unsigned char '^'*'uc; 
combi = 1.0; 

IS /* skip the following lengthy stuff if side diains are all the same */ 
if (SideChainsAreSame && X[0]) return 1; 
for (R = 0; R < nR; R++) { 
if (! (InputSourceFile = fopenpCfile[R]/r"))) return 0; 
n = CountLinesO; 
20 realloc = n != nX[R]; 
combi *= (double) n; 
if (rcalloc&&nX[R]) 

for (F = 0; F < nFType; F++) { 
for fi = 0; i < nX[Rl; i UTL_MEM_FREE( *(XtF] + R) 4- i ); 
25 UTL_^MEM_FREE( X[FJ + R ); 

} 

nXt R ] = n; 

if (realloc) for (F = 0; F < nFType; F++) 
30 if (!(*PC[F1 4- R) = (unsigned char **) 

UTL_MEM_ALLOC( sizeof( unsigned char *) ♦ nX[Rl) )) return 0; 
/* starts reading at line 2! */ 
for(i = 0; i < nX[R]; i++) { 
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if (-1 == xrrL_SCAN_GErS( InputSouiceFile, "W", Aline)) 
goto error; 
/• generate info for left-to-right read */ 

for (F = 0; F < nPType; F++) FOrderf F ] = strsti( line, FTypest F ) ); 
5 <lo{ 

for (Fd = -1, F = 0, Iptr = 0; F < nFiype; F++) 

if (FOrderm && ('fptr 1 1 FOnler[F] < Ijptr)) {fptr = FOiderlFl; Fd = 

F;} 

if(*tr){ 

10 Iptr += strlen( FTypes[ Fd ] ) + 1; /*skii^g "CTOPS=* */ 

UTL_SCAN_TOKENIZE(fptr,';','\\'); 
UTL_SCAN_TOKENIZE(fptr,' > '/W); 

if (!ReadAField( ^tr, Xsize[ Fd J -I- R, *(X[Fdl -H R) + i )) goto error; 
FOiderl Fd ] = 0; 

15 } 

} while (fptr); 

} 

fclose( InputSourceFile ); 
/* set up X - Y di^ce vectors •/ 
20 if (realloc) for (F = 0; F < nFType; F++) for (ns = 0; ns < nSym; ns++) { 
sdptr = X2Y[ns]; 

if (sdptr(R]) UTL_MEM_FREE( sdptr[R] ); 

if (!( sdptrC R ] = (double *) UTL_MEM_ALLOC( sizeof( double ) • nX(R] ) » 
return 0; 

25 for (i = 0, dp = sdptrfR]; i < nTCIR]n++) *dp-t-+ = -1.0; 

} 

} 

return 1; 
error: 

30 fjprintf( stdout, "topsim failed reading line %d of %sAnLast line read was %sAn", 
i, XfilefR], line ); 
return 0; 

} 
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char **ParseQuotedString( SDetails, nDetail, Weights ) 
char ♦SDetails; 
int *nDetail; 
double **Weights; 
5 { 

cSiar *pch, **ppch, *wch. •♦Details; 
int i; 

double ♦wt; 

/♦ first trim string to remove leading/trailing spaces and quotes ♦/ 
10 whUe (*SDetails == 1 1 ♦SDetails SDetails++; 
pch = SDetails + strlen( SDetails ) - 1; 
while rpch == j | ♦pch == • ') *pch- = 'AO'; 
/♦ each space is token delimiter ♦/ 

for (i = 0, pch = SDetails; ♦pch; pch++) 
15 if (♦pch == ' •)i++; 

♦nDetail = i+1; 

if (!(Detail$ = (char ♦♦) UTL_MEM_ALLOC( sizeoft char ♦ ) ♦ (♦nDetail) ) )) 

return 0; 
if (Weights) { 

20 if (♦(♦Wdghts = (double ♦) UTL_MEM_ALLOC( sizeof( double ) ♦ (♦nDetail) ) )) 

return 0; 

wt - *Weights; 

} 

pch - SDetails; 
25 if (*pch == — )pch++; 

for (i = 0, ppch = Detaik; i < *nDetail; i+ + , ppch++) { 
UTL_SCAN_TOKENIZE(pch.' '.'W'); 
•ppch = UTL_STR_SAVE( pch ); ^ 
if (Weights) { 
30 /• note, the copy is now being modified •/ 
if ({wch = strstr( *ppch, "="))){ 

if (!isweight( wch + 1 )) return FALSE; 
*wt = atof( wch + 1 ); 
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*wch = 'XO'; 

} 

else *wt - 1.0; 
wt++; 

pch + = strlen( pch ) + 1; 

} 

x^um( Details ); 

} 

10 int isweight( s ) 

/* returns true if value is a positive decimal value *f 
char *s ; 

{ 

char *c; 

15 for (c = s; *c; C++) if (!isdigit( *c ) && ( *c != \' )) { 

fpiintf( stdout, "Bad weight value: %s. Aborting An", s ); 
retum{ FALSE ); 

} 

re4um(TRUE); 

20 } 

int ParseRxnO 

parses complex input descriptions 

{ 

char **ParscQuotedStringO, **scratch; 
25 int nRW, iTIiX; 
double wtsum; 

/* parse fidd type information or set up standard (steric) type only */ 
if (FieldTypes) { 

if (!(FTypes = ParseQuotedString( FieldTypes, AnFType, &FWcights ) )) return 0; 
30 /* scale to average weight of unity */ 

for 0= 0, wtsum = 0.0; i < nFType; i++) wtsum +== FWeights[i]; 
wtsum /= (double) nFType; 

for (i = 0; i < nFType; i++) FWeightsI i ] /= wtsum; 
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} 

else { 
nFType =1; 

if (!( FTypes = (char •♦) UTL_MHl«_ALLOC( sizeof( diar * ) ) )) return 0; 
5 if (!( •FTypes = UTL_STR_SAVE( "CTOPS" ) )) return 0; 

if (!( FWeights = (double *) UTL_MEM_ALLOC( sizeof( double ) ) )) return 0; 
•FWdghts = 1.0; 

} 

if ('(FOrder = (char **) UTL_MEM_ALLOC( »zeof(diar *) • nFType ) )) return 0; 
10 /* parse any reaction type information present *l 
nR=0; 

if (SideChainOnly) { 
NoCore = TRUE; 
return 1; 

15 } 

if (IReactionNeeded) return 0; 
if (ScratchDetails) { 

if (!(ReactionDetails = ParseQuotedString( ScratchDetails, &npetail, NIL ) )) return 0; 
nR = nDetail; 
20 if (linitXarraysO) return 0; 

if (!(FROnler «= (diar **) UTL_MEM_ALLOC( sizeof(char *) • nFType * nR ) )) return 
0; 

/* parse any user-provided variation-wdghting */ 
25 CoreWeight = 1.0; 

if (!( RWdghts = (double *) UTL_MEM_ALLOC( sizeof( double ) * nR ) )) return 0; 
if (XWeights) { 

if (!(scratch - ParseQuotedString( XWeights, &nRW, NIL ) )) return 0; 
/* scratch will jusx be unfreed memory */ 
30 nX = nR + (NoCore ? 0 : 1); 

if(nRW != nX) ( 

fprintf( stdout, "Mismatch between count of xweights (%d) and needed 
(%d).\n", nRW, nX ); 
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return 0; 

} 

for (i = 0, wtsum = 0.0; i < nR; i++) if (!iswdght( sciatch[ i ] )) return 
FALSE; 
S else { 

RWdghts[ i ] = atof( scratch[ i ] ); 
wtsum += RWeights[ i ]; 

} 

if (INoCore) if (!iswdght( scratch[ nR ])) retuin FALSE; 
10 else{ 

CoreWdght = atof( sciatch[ nR ] ); 
wtsum += CoreWdght; 

} 

Wtsum / - (double) nX; 
15 for 0 = 0; i < nR; i4-+ ) RWdghtst i ] /= wtsum; 

if (!NoCorc) CoreWdght /= wtsum; 

} 

else for 0 = 0; i < nR; i++) RWcights[ i ] = 1.0; 
return 1; 

20 } 

int ReadEverythingO 
{ 

if (iMast^ile && !BitsetFiIe) return 0; 

if (IParseRxnQ) return 0; 
25 setbits^nbits^InitO; 

if (ilnitMasterFileO ) return 0; 

if (llnitCoreO ) return 0; 

if (IWhatsTheDifferenceO) return 0; 

if (IRetrievelnputO ) return 0; 
30 return 1; 

} 

static int ImtSym( nsym ) 
int nsym; 
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{ 

I* sets up symmetries to consider as described for core 
ONLY 2 reactants consida:ed for now! 
assumes that CoreNow is pointing to the appropriate structure ^/ 
S int i, F, maxsym; 
double ♦*x2y; 

symmetry from current core molecule if not supplied by caller *l 
nSym = nsym; 
if(!nSym){ 

10 if ((!strstr( CoteNow, ■SYM='' )) 1 1 (strstr<CoreNow, -SYM^O")) ) nSym = 1; 
if (strstr(CoreNow, "SYM-l")) nSym = 2; 
/* add more cat^ories here */ 

for 0 = 0; i<nSyin; i++) CoreSyms[ i J = 1; 
IS /* allocate distance arrays to max possible for nR */ 
if(!X2Y){ 

for (maxsym = 1, i = 0; i < nR; i++) maxsym *= (i+1); 
if (!(X2Y = (double **•) UTL_MEM_ALLOC( sizeof( double **) • nFType ) )) 
return 0; 

20 for (i = 0; i < maxsym; i++) { 

if (!(X2Y[i] = (double**) UTL_MEM_ALLOC( si2eof( double *) • nR) )) return 0; 
memset( X2Yri], 0. sizeof( double «) * nR ); 

} 

} 

25 return nSym; 
} 

int ReadCoreTopomers( CoreOK ) 

int *CoreOK; 

{ 

30 /* returns 1 unless fatal error. Sets CoreOK to TRUE if this mf entry is OK 
Also sets up symmetry considerations (which are core structure dependent), 
assumes that CoreNow is pointing to the appropriate structure 
int foo, i, R, F, Fd, Rd, rf, skipcore, ns, *Syin; 
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char labeI[lS], *nxTop, *cstart, *fptr; 

char^cnamesD = {•NX="/NY=","NZ="/CX=*/CY=-/CZ=-}; 
double coo; 
double atofO; 
5 skipooie = NoCoie; 

/♦ always ccmsider both matches iff no core ♦/ 
if(skipcore) InitSym( nR ); 

else skipcore = ! InitSym(0); 
/* dieck for any symm^-allowed rxn by rxn match of all leactant name "details" */ 
10 for (ns = 0; ns < nSym; ns++) if (CoreSyms[ns]) { 
Sym = *(SymList + ns); 
♦CoreOK = TRUE; 
if (ISidcChainOnly) 
for 0 = 0; i < nR && *CoieOK; i+ +) 
15 if (!strstr( ReactionDetaiIs( Sym[ i ] ], Xname[ i ] )) 

*CoreOK = FALSE; 
if (*CoreOK) break; 

} 

if (sidpcore 1 1 CorelsSame 1 1 !(*CoreOK )) return 1; 
20 nxTop = CorcNow; 

/* read left-to-right, so record all starting points; 
assume that coords are bundled and appear only once 

*/ 

for (F = 0; F < nFType; F++) for (R = 0; R < nR; R++) { 
25 sprintf( label. "%s%d*, FTypes[ F J, R + 1 ); 

if (!{ FROrder[ F * nR + R ] = strstr( nxTop, label ))) { 
/* some requested datum missing; then this core entry has no topomcr data; use it */ 
*CoreOK = 0; 
return 1; 

30 } 
} 

cstart = strstr( nxTop, cnames[ 0 ] ); 
do{ 
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/* find next datum in left-to-right order *t 

for (F = 0, fptr = 0; F < nPType; F++) for (R = 0; R < nR; R++) { 
rf = F • nR -h R; 

if (FROideifrfl && (!fytr 1 1 FR0rder(rl3 < fptr)) {fjptr = FRQrdeifif]; Fd = F 
5 Rd = R;) 
} 

if (cstart && (\fptr 1 1 cstart < fjptr)) {fptr = cstart; Fd = -1; } 
if (fjptr) { 

/* unpack next piece of data to propa location */ 
10 if (Fd > = 0) { 

/* then datum is a field */ 

fjptr += stilen( FTypest Fd J ) + 2; /'skipping -CTOPn=" •/ 
UTL_SCAN_TOKENIZE(fjptr,';*,'\V); 
UTL_SCAN_TOKENIZE(fiptr.' > '.'W); 
15 if (!ReadAFieId( ^tr, Xsizel Fd ) + Rd, Y(Fdl + Rd )) return 0; 

FROrderl Fd * nR + Rd ] = 0; 

} 

else { 

for(i « 0; i < 6; i++) { 
20 /* the next data are coordinates */ 

/♦ read coords, save as distances squared */ 
cstart = strstr( cstart, cnames[i]); 
if (Icstart) { 
/* Hhcn this core entry has no topomer data ♦/ 
25 *CoreOK = 0; 

return 1; 

} 

cstart -t-= strlen(cnames[i]); 
UTL_SCAN_TOKENIZE(cslart.';','\\'); 
30 , UTL_SCAN_TOKENIZE(cstart,'>','\\'); 
coo = CXcoords[ i J - atof(cstart); 
CXdiffsq[ i ] = coo • coo * DWeight; 
cstart += strlen( cstart ) + I; 
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} 

cstart = 0; 
> 

} 

5 } whUe (fjptr); 
return 1 ; 
} 

int CoreMatches( CoreOK ) 
int *CoreOK; 
10 { 

:/* returns 1 unless fatal error. Sets CoreOK to FALSE if no compound having 
this core can possibly match */ 
int F, R, i, ns, *Xct, ct; 
double sqrtO, totd, xount, cdiff; 
15 unsigned char *ptr, *qtr; 
if (NoCore | j CcwelsSame) { 
*CoreOK = TRUE; 
return 1; 

} 

20 /* can check for coordinate discrepancy fast! */ 

for (i = 0, cdiff = 0.0; i < 6; i++) cdiff += GXdiffsqH; 
if (cdiff > Distance) { 

•CoreOK = FALSE; 
return 1; 

25 } 

for (F = 0, totd = cdiff; F < nFType; F++) for (R = 0; R < nR; R++) { 
if (totd > Distance) break; 
ptr = (unsigned char *) *(Y[ F J + R); 
qtr = (unsigned char *) *(Yin[F] + R); 
30 if (!ptr 1 1 !qtr) xount = 999999.0; 

else for(xbunt=0,0, i==0; i < *(Xsize[F] -h R); i++, ptr++, qtr++) 
xount + = Dist[ ♦ptr & OxOF ][ »qtr & OxOF ] 

+ Dist[ (*ptr & OxFO) > > 41[ (*qtr & OxFO) > > 4] ; 
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totd + = xoiint FWeights[ F ] / (double) nR; 

} 

CoreDistance = totd * CoreWdght; 
•CoreOK = told < = Distance; 
S letum 1; 
} 

int FindXMatches Q { 
int R, F, i, ns, ct, *Sym, size, what ; 
doiAle totd, d. **sdptr, *dptr, xount; 
10 unsigned char *ptr. *qtr, 

/* ieinitialize indices for permuting ovct all products — 

code is gen^ for any number of variable positions */ 
for (i = 0; i < nR; i++) XctO] = 0; 

AddressSize(nR, nX, &sizc); 
15 size = (size + 31 )/32 * 4; 

if (bitset) assumes actuallsizes matches current sizes!*/ 
{ 

if (!(Dead_Products = (int *) UTL_MEM_ALLCK:(size))) return 0; 
CS_PRDCT_BITSET_^TO_RAW( bitset, Dead_Products, 0); 
20 not_here(Dead_Products,size ); 

} 

while ( TRUE ) { 
/* exit elsewhere when all products are enumerated */ 
IndexesrroAddress( nR, nX, &what, Xct); 
25 if (Dead_Products && 

TestDead(0, what) ) goto tupledone; /* not doing this one! */ 

for (ns = 0; ns < nSym; ns++) if (CoreSymslns]) { 
/* process all symmetries of current side chain combo */ 
Sym = *(SymList + ns); 
30 sdptr = ♦(X2Y + ns); 

for (R = 0, totd = CoreDistance; R < nR; R++) { 
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if (totd > Distance) break; 
/* compute next distance if not already done - DEP knows how diis works! */ 
dptr «= (*(sdptr + R )+XctIRD; 
if ((*dptr) < 0.0) for (F = 0; F < nPType; F++ ) { 
5 ptr = (unsigned char •) •( *(X(F] + R) 4- Xct[ R ]); 

qtr = (unsigned char •) *(Xin[ F ] + Syin( R ]) ; 
if (!ptr 1 1 !qtr) {*dptr = 999999.0; break;} 
dse{ 

for(xount>=0.0, i=0; i < *(Xsize[F] + R); i++, ptr-<--l-, qtr++) 
10 xount+=Dist[ •ptr & OxOF ]( *qtr&0x0F ] 

+ DistI (•ptr &. OxFO) > > 4][ (•qtr & OxFO) > > 4] ; 
•dptr + = xount * FWeights[ F ]; 

) 

} 

15 totd += *dptr * RWdghts( R ]; 

} 

/• if hit, write it out */ 

if (totd <= Distance) { 
if (JNotBitOutput I j nR != 2) { 
20 /• ASCn index form of output - also REQUIRED if more than 2 varying dements •/ 
if (iOutputFile && !OpenOuq)utFileO ) return 0; 
for (R = 0; R < nR; R++) fprintf( OutputFile, "%6d ", Xct(R] + 1 ); 
fprintf( OutputFile, "%6d%8.2f%8.2f%8.2An", StartCore, 

sqrt(totd), sqrt(CoreE)istance), sqrt(t(Md - CoreDistance) ); 

25 } 

dse { 

if (!Good_Products ) { 

if (!(Good_Products = (int *) UTL_MEM_ALLOC( size ) )) rdum 

0; 

30 memsd( Good_Products, 0, size ); 

} 

FlagProduct(Good_Products, 0, 0, what ); 

} 
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noutH-+; 



if (NoMoiehitsPlease && nout > = NoMorehitsPIease) goto dcme; 



output only one acceptable symmetry per product "^1 



5 



goto tupled(»ie; 
} 



/* generate next index tuple, AKA candidate pnxiuct */ 



10 



tupledooe: 

ct = nR- 1; 
while ( TRUE ) { 



Xct[ctl++; 

if (Xct[ ct ] < nX[ ct ]) break; 
/* finished when first index exceeds limit — the other exit */ 
if (ct = = 0) goto done; 



ct-; 

} 

} 

done: 

20 ou^t any products from this dataset */ 
if (NotBitOutput 1 1 nR != 2) { 
if (OutputFile) fclose(OutputFile); 
Ou^utFile = 0; 

} 

25 else if (Good ^Products) { 
WriteStdFileO; 

UTL_MEM_FREE( Good_Products ); 
Good^Products = (int*) 0; 

} 

30 return 1; 
} 

int MakeOutputPileNameO { 

/* a run may produce multiple files, and the user pr<d)ably can't tell, 



15 



Xct[ ct ] = 0; 
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so append a sequence J to subsequent base names *l 
if (InOutFiles) { 

sprintf( OutputHleName, ''%s'', OutputFileBase ); 
/* base name ready for next call *l 
5 strtok( Ou^utFiieBase, ); 



} 

dse sprintf( OutputFileName, "%s_%d.%s*, OuQmtFileBase, 



nOutFiles, OutputFileBase + strien(Ou4nitFileBase) + 1 ); 
nOutFiles++; 



int WriteStdFileO { 
/* writes out the bit set of products */ 
int sizes[2]; 
int allocSizesP] . ; 
IS int numInSite$[2] ; 
void '^'compressed ; 
int total ; 



10 } 



sizes[0] = nX[0] ; 
si2es[l] = nXtl] ; 



20 



numInSites[0] = numInSttes[l] = -1 ; 
allocSizes[01 = allocSizesIl] = -1 ; 
compressed = NIL; 



25 



total = 0; 
MakeOutputFileNameQ; 
WriteOutCheckPointFile( OutputFileName, 



30 



MasterFile, 

MasterRecord, 

comline, 

Good_Products, 

0, 

2, 



sizes. 



allocSizes, 
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nout, 

numlnSites, 
total, 

compressed); 

5 } 

int ReadNextGore( SideChainsAreSame, CoielsSame ) 
int ^SideChainsAieSame; 
int *CoieIsSaine; 

{ 

10 continues reading through master file for more matching Reaction Classes. 

If the side chain files have the same name, can skip rebuild of X diffs */ 
diar *foo; 

int i, d, rxMatch = 0, val; 
if (AIICOTBS) { 

15 if ( -1 == UTL_SCAN_GErS( CorcFUe_Fiie. "W", V. &foo)) fclosc( 

CorcFUe_Rle); 
else { 

/* get next core ready and quit */ 

CoreNow = UTL_STR_SAVE(foo); 
20 '''SideCbainsAreSanie = TRUE; 

StartCore+ + ; 
r^m 1; 

} 

} 

25 while ( IrxMatch ) { 

if ( -1 == UTL_SCAN_GETS( MasterFile_File, "\\", &foo)) return 0; 
I* preliminary match if (1) Reaction Needed matches and (2) 

NOjoore must be present if NoCore is TRUE (or vice versa) */ 
TxMaich = ( strstr(foo/Reaction class **) && strstr(foo, ReactionNeeded) 
30 && ((fNoCore && !strstr( foo, -NO_core" ) ) 

j I ( NoCore && strstr( foo, "NO^core" ) ) ) ); 
if (feof(MasterFiIe_File)) return 0; 
/* skip name, record number of reagents */ 
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if (rxMatch) { 

if ( -1 == UTL_SCAN_GETS( MasterFile_File, "W", &foo)) return 0; 
if ( -1 = s= UTL_SCAN_GErS( MasterFiIe_File, "W", &foo)) return 0; 
if( ! UTL_STR_ATOI(foo, Aval) ) return 0; 
5 if (val != nR) rxMatch = 0; 

} 

if(ixMatch) { 

/* skip ^t stuff, record core and ade chain file stuff */ 
10 if ( -1 = = UTL_SCAN_GETS( MasterFile_Fite. "W", '§', &foo)) return 0; 

if ( -1 == UTL_SCAN_GETS( MasterFUe_File. "W", &foo)) return 0; 
if ( -1 == UTL_SCAN_GETS( MasterFUe_File, "W", &foo)) return 0; 
*CoreIsSanie = TRUE; 
if (strcmp( foo, Corefile )) { 
15 *CoreIsSame = FALSE; 

UrL_MEM_FREE( Corefile ); 
Corcfile = UTL_STR_SAVE(foo); 

} 

if ( -1 == UTL_SCAN_GETS( MasterFile_File. 'W", &foo)) return 0; 
20 if ( ! UTL_STR_ATOI(foo, Aval ) ) return 0; 

if (val != StartCore ) *CordsSame = FALSE; 
StartCore = val; 
if(! •CorelsSame ) { 
if (CoreFile_File) fclose(CoreFile_File); 
25 if (! (CoreFile_File = fopen(Corcfile."r"))) return 0; 

i=0; 

while ( i < StartCore ) { 

if ( -1 == UTL_SCAN_GErS( InputSourceFile, "W". &foo)) return 0; 
if (AllCores) break; 

30 

} 

CoreNow = UTL_STR_SAVE( foo ); 

} 
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if ( -1 == UTL_SCAN_GETS( MasterFile_File, "W, &foo)) rrtum 0; 
if ( -1 == UTL_SCAN_GETS( MasterFile_File. "W", &foo)) return 0; 
^ideChainsAreSame = TRUE; 

for 0 » 0; i < nR; i4--l-) if (!initXfiles( i, SideChainsAieSame ) ) return 0; 

5 } 
} 

feturn 1; 

} 

/* this belongs in fhe uti module, actually */ 
10 int MakeComUne( char *line, int Icn, int argc, char **argv) 

{ 

int i, nch, totch = 0; 
sprintfOine," %s " ,argv[0]); 
for(i=l;i<argc && totch < ^ len;i++) 
15 { 

nch = strleh(line); 
line += nch; 
totch + = nch; 

if (totch < len ) sprintf(line,"%s ".argvpl); 

20 } 
} 

int CheclcPointPrograin(void) { 
fprintf(stdeiT,"CheclcPointPrc)gramO is a londy stub in topsim.c!\n"); 

} 

25 int main( argc, argv ) 
int argc; 
char ♦♦aigv; 
{ 

int processing; 
30 if( !ParseArguments( argc, argv ) ) 

goto SyntaxError; 
MakeComLine( comline, 2048. argc, argv ); 
if (IReadEverythingO) goto FailureExit; 
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processing — 1; 
while 0>n>cessing) { 

if (!ReadCoreTopomers( &CoreOK )) goto FaiiuieExit; 

if (CoieOK && !CoreMatches( ACoreOK )) goto FaiiureExit; 
S if (CoreOK && !ReadXsO) goto FaiiureExit; 

searched + ~ combi; 

if (CoreOK && IFindXMatdiesQ) goto FaiiuieExit; 
totnout 4- - nout; 
nout = 0; 

10 processing = ReadNextCore( ASideChainsAreSame, &CorcIsSame ) && 

(!NoMorehitsPIease | } nout < NoMorehitsPlease); 

} 

ipfintf(stdout, "Normal Exit: %d of %( are neighbors\n", totnout, searched ); 
UserAborted ? exit(ETTorExit). : exit(GoodExit); 
15 SyntaxError: 
exitd); 
FaiiureExit: 

exit(ErrorExit); 

} 

20 /* 

numVariations is numba: of dimensions Y_01, YJ)2 etc (normally 2) 
dsize contains the nY_01, nY_p2 etc 
address is the bit number (0 to N-1) 

choices will contain the offsets (0 based) of Y_01, Y_p2 etc, on rctam_ 

25 */ 

int AddressTdlndexesOnt numVariations, inl *allPtr, irit address, int *chPtr ) 
{ 

for ( chPtr + = (numVariations - 1 ), allPtr + = (numVariations - 1) ; 
numVariations- ; 
30 allPtr--, chPtr-) 

{ 

*chPtr = address % *alIPtr; 
address = address / *allPtr; 
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} 

return 1; 



int IndexesToAddressOnt numVariations, int ^'allPtr, int ^address, int *1nd) 
5 { 

int i ; 

int indx = 0 ; 

for (i=0;i < numVariations;i+ +) 
indx += indx * aliPtrO] + indpj; 
10 '^address = indx; 

return 1 ; 

} 

int AddressSizeOnt numVariations, int ♦allPtr, int *size) 
{ 

15 for ( *size = 1 ; -numVariations; alIPU-+4-) *aze *allPtr; 
return 1; 

} 

int not_here( what, nbytes ) 
unsigned char '^what; 
20 int nbytes; 

{ 

for ( ; nbytes; -nbytes) *what+ + = *what; 
r^m 1; 

} 
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AtMXsndix "T" 

@macro FragCTOPS ChSp 
» 

# Entry point for Wd>-based topomeric search initialization 
# 

# sets up a set of tqpom^c searches, by identifying tqxinoer data arising 
ftom 

10 # substructural seardiing of SLN pattmis found in topfrag.ti)! to the 

# query structure and generating the topomeric data and search command file 
entry 

# for all resulting fragmentations of the query structure. 
# 

15 # The Query SLN(s) are assumed to be referenced by $CS_QUERY; 

# The file(s) to be searched are referenced by $CS_DATASET (space 
sqsarated) 

t The directory where command files are to be written is $CS_TEMPDIR 

# The GUI parameters are to be in SCS^PARAMETERS 
20 # The name of the ou^ut file(s) is to be in $CS_OUTPUT 

# read in the data 
glqbalvar crop 
globalvar ACD!T(q>Inited 

localvar fcmdn fcmd tdn dist t base mf mfo nln nxid im fcrm rxids doit 
25-^# check the input parameters 

setvar ferm %cat( $CS_TEMPDIR VCSenor.log" ) 

setvar ferr %open( $ferm "w" ) 

setvar flogn %cat( $CS_TEMPDIR Vtopfrag.log- ) 

setvar flog %open( Sflogn "w" ) 
30 setvar fcmdn %cat( $CS_TEMPDIR VCSCommandsxmd" ) 

setvar fcmd %opta{ $fcmdn "w" ) 

if %not( $fcmd ) 

%write( $ferr could not open temp file Sfcmdn to write ChemSpace search 



wo 97)27559 PCTAJS97/01491 

640 

cmds. Quitting ) >$nulldev 
return 
endif 

for tsln in $CS_QUERY 
5 if %pos( $tsln ) 

setvar nogood TRUE 
if %posC<-$tsln) 

if %gt( %pos( ■/ $tsln ) %pos( $tsta ) ) 
setvar nogood 
10 endif 
endif 

if $nogood 

%write( $ferr Topomeric searches require a monomolecular search target. 
Quitting ) >$nulldev 
IS goto error 

endif 
endif 

%wrile( $flog QUERY: Stsln >$nulldev 

setvar dist %CSj)arain_parse( distance $CS_PARAMETERS 9L0 ) 
20 if %not( $dist ) 

%write( Sferr No topomeric distance provided. Quitting ) >$nulldev 
goto error 
endif 

setvar priority %CS jiaiamjiarse( priority $CS_PARAMETERS 3.0 ) 
25 if %not( $priority ) 

%write( $fcrr No reaction priority provided. Quitting ) >$nulldev 

goto error 
endif 

%write( Sflog Fragment Priority: Spriority ) >$nuUdev 
30 setvar CTOP[ ONLYl ] %CS jwram j>arse{ only^subs $CS_PARAMETERS ) 
if $CTOP( ONLY 1 

%write( Sflog Matching Side Chain Only ) >$nulldev 
endif 
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setvar CTOP( WEIGHTS ] %CSj>arainj)arse( xwdghts $CS_PARAMETERS ) 
if$CTOP[ WHGHTSJ 

%write( $flog User Specified Wdghdng as: $CTOP( WEIGHTS ) ) >$nulldev 
for w in $CTOP[ WEIGHTS ] 
5 setvar pats %seaich2d( Stsln %arg( 1 %set_unpack( $w ) ) NoDup 0 y ) 

if %not( Spats ) 

%write( $fOT Weighted search for fragment %aig( 1 %set_unpack( $w ) ) 

not 

in Stsln - can*t hsqypen! ) >SnuUdev 
10 goto error 

else 

if %gt( %counl( Spats ) 1 ) 

%write( Sflog NOTE: Multiple hits for wdghting fragment %arg( 1 ^ 
%set_unpack( $w ) ) in Stsln ) >$nulldev 
15 endif 
endif 
endfor 
endif 

setvar CTOP[ CHBD ] %CS_param_parse( hbonding $CS_PARAMETERS ) 
20 if SCTOP[ GHBD ] 

%write( Sflog HELDS include Hydrogen Bonding with weight of SCTOP[ CHBD ] 

) 

>Snulldev 
endif 

25 zap ml >SnuUdev 

%sln_to_raol( ml Stsln ) >$nulldev 
if %moIempty( ml ) 

96write( Sfcrr SYBYL cannot handle search target (SLN is: Stsln ). 
Quitting) >Snulldev 
30 goto error 

endif 

setvar t %molJnfo( ml NATOMS ) 
FILLVALENCE Ml(*) H 1,0 1.5 1.0 1.5 >$nuUdev 



wo 97/27559 PCT/US97/01491 

642 

if $CTOP[ONLYl ] 

if %neq( %mol_info( ml NATOMS ) %math( $t + 1 ) ) 
%write( Sferr Side chain search but laiget $tsln has other than one 
unfilled valence ) >$nulldev 
5 goto error 

endif 

dse 

if %neq( %inolJnfo( ml NATOMS ) $t ) 
%write( $ferr Seardi Target $tsln has unfilled valences. Quitting ) 
10 >$nulldev 

goto error 
endif 
endif 

if$CTOPIONLYl] 
IS # only one side chain to model is a special case 
CTOPfSideChainOnly $fcmd $ferr $flog $dist 
else 

# check for custom topomer fragmentation table or selection 
setvar tftabn 
20 setvar tfrows 

ifSCS^TOPFRAG 

setvar t %pos( V $CS_TOPFRAG ) 
if %not( $t ) 

%write( Sferr Custom table name SCSJTOPFRAG missing an ) >$nulldcv 
25 goto error 

else 

setvar tftabn %substr( $CS_TOPFRAG 1 %malh( $t - 1 ) ) 
setvar tfrows %substr( $CS_^TOPFRAG %math( $t + 1 ) ) 
endif 
30 endif 

if %set_and( "^set^createC %table_nameO )" TOPFRAG ) 

table close TOPFRAG 
endif 
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if %not( Stftabn ) 

setvar tftabn %cat( $DSERV_TB topfrag.tbl ) 

table recall Stftabn >$nulldev 
5 if %not( %set_and( •'%s^_create( %table_nameO )" TOPFRAG ) ) 
%write( $(&T Stftabn not found. Quitting ) >$nulldev 
goto error 
endif 

%write( Sflog Topomer fragmentation table is %cai( SDSERV TB topfrag.tbl 
10 ) ) >SnuIldev 

t initialize random file name sequence generator 
setvar t %timeO 

setvar base %rand( %substr( "$t" %math( %strlen( "St" ) - 6 ) 2 ) ) 
TAILOR SET MAXIMIN2 MAXIMUM_rrERATIONS 1000 \ \ 
15 %write( Sflog Master filc(s): SCS__DATASET ) >$nulldev 

%write( Sflog TOPFRAG table: Stftabn - Row selection: Stfrows ) 
>$nulldev 

if %riot( Stfrows ) 

setvar tfrows %set_create( %range( 1 %table_attribute( NROWS ) ) ) 
20 endif 

for rxid in %set_unpack( Stfrows ) 

# processing ... 

%write( Sflog - - - - ) >$nulldev 

# choek priority 

25 TABLE Default TOPFRAG 

if %gt( %roell( Srxid PRIORITY ) Spriority ) 

%write( Sflog TOPFRAG entry Srxid priority > Spriority. ) >Snulidev 
break 
endif 

30 setvar CTOP[RxnCount][Srxid] 0 

if %CS_ReactantMatch( Srxid Sfcmd Sferr Stsln Sflog ) 

%write( Sflog > > > Topomer search queueing (TOPFRAG row Srxid) ) 

>Snulldev 
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CS!Queue_Search Sfcmd Srxid $dist $flog 

mdif 

endfor 
eodif 
S endfor 

:# may need to purge or rename error file here! 
%dose( Sfcmd ) 
%clo$e( $ferr ) 
%close( Sflog ) 
10 return 
error: 

%close( Sfcmd ) 
ensure nothing in search comnuuid file ! 
%file_delete( Sfcmdn ) >$nulldev 
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CLAIMS 

What is claimed is: 

1 . A computer-based m^od for selecting, for all pos^ble product molecules which could 
be created in a combinatorial synthesis ftom spedfied leactant molecules and common core 
molecule, a subset of product molecules, comprising the following stq>s: 

a. Characteriang all the reactant molecules widi a validated molecular structural 
descriptor sqppropriate to leactant molecules; 

b. Hierarchically clustering the characterized reactant molecules until the intercluster 
distance corresponds to the ndghborhood distance of the validated molecular 
structural descriptor or to a value close to the neighboriiood distance which creates 
a logical clustering break; 

c. Selecting a reac^t molecule from each cluster, 

d. Combinatorially assembling the selected reactant molecules and core molecule into' 
products which would be created in the chemical synthesis; 

e. Selecting a product molecule Tor inclusion in the subset; 

f. Using a validated molecular structural descriptor appropriate to whole molecules, 
calculating the descriptor distance between all sdected product molecules and all 
other product molecules; 

g. Determining the shortest distance b^een each product molecule and all product 
molecules previously selected; 

h. Selecting for inclusion in the subset the product molecule whose shortest descriptor 
distance from the previously selected molecules is the largest and is greater than the 
neighborhood distance of the descriptor; 

i. Repeat stq)s f through h until the largest shortest difference between molecules is less 
than the neighborhood distance of the descriptor; and 

j. Outputing a list of the selected product molecules and/or the reactant molecules from 
which the selected product molecules can be formed. 

2. The method of claim 1 in which the vaUdated molecular structural descriptor appropriate 
to reactant molecules is topomeric CoMFA fields. 

3. The method of claim 2 in which topomeric hydrogen bond fields are used in conjunction 
with the topomeric CoMFA fields descriptor. 

4. The method of claim 2 in which the validated molecular structural descriptor appropriate 
to whole molecules is the Tanimoto 2D coefficient. 
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5. The method of daim 4 in which before slq> g, reactant molecules with the following 
characteristics are removed from further use in the method: 

a. toxic reactant molecules; 

b. reactant molecules containing metals, improp^ forms of tautomers, and interfering 
5 diemical groups; 

c. reactant molecules with too low a bioavailability; 

d. reactant molecules not likdy to cross membranes; and 

e. reactant molecules containing biologically non*rdevant groups. 

6. The method of claim S in which before stq> product mdecules with the following 
10 diaiac^stics are removed from further use in the method: 

a. product molecules having MW ^ 750; and 

b. product molecules not having a CLOGP between -2 and 7.S. 

7. The method of claim 1 in which the validated molecular structural descriptor appropriate 
to whole molecules is the Tanimoto 2D coefficient. 

IS 8. The method of claim 7 in which before step a, reactant molecules with the following 
characteristics are removed from further use in the method: 

a. toxic reactant molecules; 

b. reactant molecules containing metals, improper forms of tautomers, and interfering 
chemical groups; 

20 c. reactant molecules with too low a bioavailability; 

d. reactant molecules not likdy to cross membranes; and 

e. reactant molecules containing biologically non-rdevant groups. 

9. The method of cisum 8 in which before step g, product molecules with the following 
diaracteristics are removed from further use in the m^hod: 

25 a. product molecules having MW ^ 750; and 

b. product molecules not having a CLOGP between -2 and 7,5. 

10. A computer-based method for sdecting, for all possible product molecules which 
could be created in a combinatorial synthesis from spedfied reactant molecules, a subset of 
product molecules, comprising the following steps: 

30 a. Characterizing all the reactant molecules with a validated molecular structural 

descriptor appropriate to reactant molecules; 
b. Hierarchically clustering the characterized reactant molecules until the intercluster 
distance corresponds to the neighborhood distance of the validated molecular 
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stnictuial descriptor or to a value close to the neighboriuxxl distance which creates 
a logical dusting break; 

c. Selecting a readant molecule from each cluster, 

d. Combinatorially assembling the sdected leactant molecules and core molecule into 
products which would be created in the chemical synthesis; 

e. Selecting a product molecule for inclusim in the subset; 

f. Using a validated molecular structural desciif^ aiyiropriate lo whole molecules, 
calculating the descriptor distance between all selected product molecules and all 
other product molecules; 

g. Determining the shortest distance between each produa molecule and all product 
molecules previously selected; 

h. Selecting for inclusion in the subset the product molecule whose shortest descriptor 
distance from the previously selected molecules is the largest and is greats dian the 
neighborhood distance of the descriptor; 

i. Repeat steps f through h until the largest shortest difference between molecules is less 
than the neighbortiood distance of the descriptor; and 

j. Outputing a list of the selected product molecules and/or the reactant molecules from 
which the selected product molecules can be formed* 

11. The method of claim 10 in which the validated molecular structural desmptor 
sqypropriate to reactant molecules is topomeric CtoMFA fields. 

12. The method of claim 1 1 in which topomeric hydrogen bond fields are used in 
exjunction with the topomeric GoMFA fields descriptor. 

13. Hie method of claim 1 1 in which the validated molecular structural descriptor 
q>propriate to whole molecules is the tanimoto 2D coefficient. 

14. The method of claim 13 in which before stq) a, reactant molecules with the following 
characteristics are removed from further use in the method: 

a. toxic reactant molecules; 

b. reactant molecules containing metals, improper forms of tautomers, and interfering 
chemical groups; 

c. reactant molecules with too low a bioavailability; 

d. reactant molecules not likely to cross membranes; and 

e. reactant molecules containing biologically non-relevant groups. 

15. The method of claim 14 in which before step e, product molecules with the following 
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characteristics are removed from further use in the mediod: 

a. product molecules having MW ^ 750; and 

b. product molecules not having a CLOGP between -2 and 7.5. 

16. The mediod of daim 10 in vfbkh the validated molecular stmctural descriptor 
5 2^propriate to whole molecules is the Tanimoto 2D coefficient 

17. The method of daim 16 in which before stq> a, reactant molecules with the following 
diaract^stics are removed from furth^ use in the method: 

a. toxic reactant molecules; 

b. reactant molecules containing metals, imprcyper forms of tautomers, and int^ering 
10 chemical groiqps; 

c. reactant molecules with too low a bioavailability; 

d. reactant molecules not likdy to cross membranes; and 

e. reactant molecules containing biologically non-relevant groups. 

18* The methodof claim 17 in which tefore step e, product molecules with the following 
15 characteristics are removed from further use in the method: 

a. product molecules having NfW > 750; and 

b. product molecules not having a CLOGP between -2 and 7.5. 

19. A system for selecting, for all possible product molecules which can be created in a 
combinatorial synthesis from all spedfied reactant molecules and comnK)n core molecule, a 
20 subset of product molecules whose members collectivdy rqnesent most of the molecular 
structural diversity in the possible combinatoriaily synthesized product molecules, comprising: 

a. Means for characterLnng all the reactant molecules with a validated molecular 
structural descriptor impropriate to reactant molecules; 

b. Means for hierarchically clustering the charabcterized reactant molecules until the 
25 intercluster distance corre^nds to the ndghborhood distance of the validated 

molecular structural descriptor or to a value close to the neighborhood distance which 
creates a logical clustering break; 

c. Means for selecting one reactant molecule from each cluster; 

d. Means for combinatoriaily assembling the selected reactant molecules and core 
30 molecule into products which would be created in the chemical synthesis; 

e. Means for selecting at least one product molecule for inclusion in the subset; 

f. Means for using a validated molecular structural descriptor applicable to whole 
molecules for calculating the descriptor distance between all selected product 
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molecules and all other product molecules; 

g. Means for determining the shortest distance between each product molecule and all 
product molecules previously selected; 

h. Means for selecting for incluaon in the subset the product molecule whose shortest 
desoiptor distance frwn the previously sdected molecules is the largest and is greater 
than the neighborhood distance of the descriptor; 

i. Means for invoking means f through b untU the largest shortest difference between 
molecules is less than die ndghborhood distance of the descriptor; and 

j. Means for ou^uting a list of die selected product molecules and/or the leactant 
molecules from which die selected pioduct molecules can be formed. 

20. The system of claim 19 in which the reactant appropriate molecular structural 
descriptor is tc^meric CoMFA fields. 

21. The system of claim 20 in which topomeiic hydrogen bond fidds are used in 
conjunction widi tiie tcqximeric CoMFA fields descriptor. 

22. The system of claim 20 in which tiie whole molecule appropriate molecular structural 
descriptor is die Tanimoto 2D ooeffident. 

23. A system for selecting, for all possible product molecules which can be created in a 
combinatorial synthesis from all specified reactant molecules, a subset of product molecules 
whose members collectively represent most of die molecular structural diversity in die possible 
combinatorially syntherized product molecules, comprising: 

a. Means for characterizing all die reactant molecules witii a vaUdated molecular 
stiuctund descriptor an>n)priate to reactant molecules; 

b. Means for hierarchically clustering die characterized reactant molecules until die 
interduster distance conesponds to die neighborhood disbuice of die validated 
molecular structural descriptor or to a value close to die ndghboriiood distance which 
creates a logical clustering break; 

c. Means for selecting one reactant molecule from each cluster; 

d. Means for combinatorially assembling die selected reactant molecules into products 
which would be created in die chemical syndiesis; 

e. Means for selecting at least one product molecule for inclusion in die subset; 

f. Means for using a validated molecular structiiral descriptor appUcable to whole 
molecules for calculating die descriptor distance between all selected product 
molecules and all odier product molecules; 
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g. Means for determining the shortest distance between each product molecule and all 
product molecules previously selected; 

h. Means for selecting for inclusion in the subset the product molecule whose shortest 
descriptor distance from the {Hcviously selected molecules is the largest and is greater 
than the neighborhood distance of the descriptor, 

i. Means for invoking means f through h until the largest shortest diffoence between 
mdecules is less than the neighborhood distance olxhe descriptor; and 

j. Means for ousting a list of the selected product molecules and/or the reactant 
molecules from which the sdected product molecules can be formed. 

24. The system of claim 23 in which the reactant q>pr<^riate niolecular structural 
descriptor is tcqx>meric CoMFA fields. 

25. The system of claim 24 in which topom^c hydrogen bond fields are used in 
conjunction with the topomeric CoMFA fidds descriptor. 

26. The system of claim 24 in which the whole molecule apprc^riate molecular structural 
descriptor is the Tanimoto 2D coefficient. 

27. A combinatorial screwing library designed by a computer-based method, which 
sdects the screening library nK>lecules from those molecules which could be created in a 
combinatorial synthesis from specified reactant molecules and common core molecule, 
compri^g the following steps: 

a- Characterizing all the reactant molecules with a validated molecular structural 
descriptor appropriate to reactant molecules; 

b. Hierarchically clustering the characterized reactant molecules until the intercluster 
distance corresponds to the neighborhood distance of the validated molecular 
structural descriptor or to a value close to the neighborhood distance which creates 
a logical clustering break; 

c. Selecting a reactant molecule from each cluster; 

d. Combinatorially assembling the selected reactant molecules and core molecule into 
products which would be created in the chemical synthesis; 

e. Selecting a product molecule for inclusion in the subset; 

f. Using a validated molecular structural descriptor appropriate to whole molecules, 
calculating the descriptor distance between all selected product molecules and all 
other product molecules; 
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g. Determining the shortest distance between each product molecule and all product 
molecules previously selected; 

h. Sdecting for inclusion in the subset the product mdecule whose shortest descriptor 
distance from the previously selected molecules is the largest and is greater than the 
ndghborhood distance of the deftxiptor; 

i. Repeat st^s f through h until the largest shortest difference between molecules is less 
dian the ndghborhood distance of the desoiptor; and 

j. Outputing a Ust of die selected produa molecules and/or the reactant molecules from 
which the selected product molecules can be formed. 

28. The method of claim 27 in which die validated molecular structural descriptor 
appnqmate to reaclant molecules is topomoic CoMFA fields. 

29. The method of claim 28 in which tcqwmeric hydrogen bond fields are used in 
conjunction wiUi the topomeric G>MFA fields descriptor. 

30. The method of claim 28 in which the validated molecular structural descriptor 
appropriate to whole molecules is the Tanimoto 2D coefficient 

31. A combinatorial screening library designed by a computer-based m^od. which 
selects the screening library molecules from diose molecules which could be created in a 
combinatorial synthesis ftom specified reactant molecules, comprising die following steps: 

a. Characterizing all die reactant molecules widi a validated molecular structural 
descriptor a^ropriate to reactant molecules; 

b. Hierarchically clustering die characterized reactant molecules until die intercluster 
distance corresponds to the neighboriiood distance of die validated molecular 
structural descriptor or to a value close to die neighboriiood distance which creates 

- a logical clustering break; 

c. Sdecting a reactant molecule ftom each clusto-; 

d. Combinatorially assembUng die selected reactant molecules and core molecule into 
products which would be created in the chemical syndieas; 

e. Sdecting a product molecule for inclusion in die subset; 

f. Using a validated molecular structural descriptor appropriate to whole molecules, 
calcuUting die descriptor distance between all sdected product molecules and all 
odier product molecules; 

g. Determining die shortest distance between each product molecule and all product 
molecules previously sdected; 



WOy7/27559 PGTAJSy7/01491 

652 

h« Selecting for inclusion in the subset the product molecule v4iose shortest descriptor 
distance from the previously selected molecules is the laiigest and is greater than the 
ndghborhood distance of the descriptor; 
i. Rqieatstqisf dm>ughh until the laigest shortest difference between mol^^ 
S than the ndghborhood distance of the descc^tor, and 

j. Outputing a list of the selected product mdecules and/or the reactant molecules from 
which the sdected produO^molecules can be formed. 
32. The method of clmm 31 in which the validated molecular structural desCTptor 
appropriate to reactant molecules is tppomeric CoMFA fields. 
10 33. The method of claim 32 in which topomeric hydrogen bond fields are used in 
conjunction with the topomeric CoMFA fields descriptor. 

34. The method of claum 32 in which the validated molecular structural descriptor 
appropriate to whole molecules is the Tanimoto 2D coefficient. 

35. A computer-based method for charactOTang -the relative validity or usefubiess of 
15 molecular structural descriptors using multiple literature data sets containing a variety of 

chemical structures and associated activities comprising the following stq>s: 

a. Applying the molecular structural descrijrtors to all compounds represented in each 
data set to derive descriptor values; 

b. Gonstructing a Patterson plot for each molecular structural descriptor for each data 
20 set using the descriptor values for tiie compounds in each dau set and tiieir associated 

activities; 

c. Determining tiie ai^ropriate Patterson plot line and tiie corresponding denrity ratio 
for each molecular structural descriptor for each data set; 

d. Determining tiie number of data sets for each molecular structural descriptor for 
25 which tiie Patterson plots have a density ratio greater than a predetermined cut-off 

value; and 

e. Creating a ranking ratio for each molecular structural descriptor in which Oie 
nummtor is the number determined in step d and tiie denominator is the number of 
data sets, said ranking ratio for each molecular structural descriptor being 

30 rq>resentative of the relative validity or usefulness of each molecular structural 

descriptor wherein higher values of the ranking ratio rq>resent a higher degree of 
validity/usefulness. 
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36. The method of claim 35 in which in step d the predetermined cut-off is about 1.1. 

37. A computer-based method of merging vnth a base assembly of molecules one or more 
additional assemblies of molecules, similar molecules in the assemblies having previously been 
identified and removed using a validated molecular structural descriptor, comprising the st^ 
of: 

a. Using a validated molecular structural descriptor which is appropriate to whole 
nu)lecules, characterizing all the molecules in the base assembly of molecules and in 
the assembly of molecules to be moged; 

b. Calculating the molecular structural distance between every molecule in the base 
assembly to every molecule in the assembly to be merged; 

c. While there are still molecules in the assembly to be merged which have not been 
tested, selecting a molecule from the assembly to be merged; 

d. Determining whether the molecular structural distance between the selected molecule 
and every molecule in the base assembly is within the neighborhood distance of the 
molecular structural descriptor; 

e. Select for inclusion in the merged assemblies only those molecules identified in stq> 
d as having molecular structural di^ances greater than the ndghborhood distance. 

f. Repeat step £ through step £ until all molecules in the assembly to be merged have 
hecxi tested; and 

g. Rq)eat stq) a through step f for each additional assembly to be merged, 

38. The method of claim 37 in which the molecular structural descriptor expropriate to 
whole molecules in the Tanimoto similarity coeffident 

39. A computer-based method of merging with a base assembly of molecules one or more 
additional assemblies of molecules, similar molecules in one or more of the assemblies having 
not previously been idoitified and removed using a validated molecular structural descriptor, 
comprising the steps of: 

a. Sdecting subsets of each assembly by: 

(1) Selecting a molecule within each assembly; 

(2) Using a validated molecular structural descriptor appropriate to whole 
molecules, calculating the descriptor distance between the selected molecule and 
all molecules within the assembly; 



(3) Determining the shortest distance between the selected molecule and all 
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molecules previously selected for the subset; 
(4) Sdecting for inclusion in the subset the molecule whose shortest descriptor 
distance from the previously selected molecules is the largest and is greater 
than the nea^borhood distance of the descriptor; 
5 (5) Repeat steps £0 through {4) untU the largest shortest diffeicnce between 
mdecules is less than the ndghboifaood distance of the descriptor, and 
(6) Repeat Sbeps £1} through (SI for each assembly; 

b. U^ng a validated mcdecular structural descriptor which is qjpropriate to whole 
molecules* characterizing all ttie molecules in the base assembly of molecules and in 

10 the assembly of molecules to be merged; 

c. Calculating the molecular structural distance between every molecule in the base 
assembly to every molecule in the assembly to be merged; 

d. While th^ are still molecules in die assembly to be merged which have not been 
tested* selecting a molecule from the assembly to be m^ged; 

IS e. Determining whether the molecular structural distance between the selected molecule 

and every molecule in the base assembly is within the neighborhood distance of the 

molecular structural descriptor; 
f . Select for inclusion in the m^ged assemblies only those molecules identified in step 

£ as having molecular structural distances greater than the neighborhood distance. 
20 g. Repeat step i through step f until all molecules in the assembly to be merged have 

been tested; and 

h. Repeat step b through step g for each additional assembly to be merged. 
40. The use of a subset of niolecules, which could be made in a combinatorial synthesis 
of q)ecified reactants and core, to specify the compounds to be synthesized and tested in 
25 biolc^ical scre^iing assays, said subset being sdected by the following computer-based 
method: 

a. Characterizing all the reactant molecules with a validated molecular structural 
descriptor sq^nqiriate to reactant molecules; 

b. Hierarchically clustering the characterized reactant molecules until the intercluster 
30 distance corresponds to the ndghborhood distance of the validated molecular 

structural descriptor or to a value close to the neighborhood distance which creates 
a logical clustering break; 
C; Selecting a reactant molecule from each cluster; 
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d. Combinatorialiy assembling the selected reactant molecules and core molecule into 
pixxlucts which would be created in the chemical synthesis; 

e. Selecting a product molecule for inclusion in the subset; 

f. Using a validated molecular structural descriptor s^ropriate to whole mdecules, 
calculating the descriptcn- distance between all sdected pnxluct molecules and all 
otho- product molecules; 

g. D^mnining the shortest distance between each product molecule and alLproduct 
molecules previously sdected; 

h. Selecting for inclusion in the subset the product molecule whose shortest descriptor 
distance from the previously selected molecules is the largest and is greater than the 
neighborhood distance of the descriptor; 

i. Rq)eat steps f through h until the largest shortest difference betweai molecules is less 
than the neighborhood distance of the descriptor, and 

j. Outputing a list of the selected product molecules and/or the reactant molecules from 
which the selected product molecules can be formed. 

41. The method of claim 40 in which the validated molecular structural descriptor 
^>propriate to reactant molecules is tc^meric CoMFA fields, 

42. The method of claim 41 in which topomeric hydrogen bond fields are used in 
oonjunctiOT with the topomeric GoMFA fields descriptor. 

43. The method of claim 41 in which the validated molecular structural descriptor 
appropriate to whole molecules is the Tanimoto 2D coefficient. 

44. The molecules selected, from those which could be made in a combinatorial synthesis 
of specified reactants and core, by die following computer-based method: 

a. Characterizing all the reactant molecules with a validated molecular structural 
<tescriptor appropriate to reactant molecules; 

b. Hierarchically clustering the characterized reactant molecules until the intercluster 
distance corresponds to the ndghbwhood distance of die validated molecular 
structural descriptor or to a value dose to the neighborfiood distance which creates 
a logical clustering break; 

c. Selecting a reactant molecule from each cluster; 

d. CombinatoriaUy assembling the selected reactant molecules and core molecule into 
products which would be created in the chemical synthesis; 

e. Selecting a product molecule for inclusion in the subset; 
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f. Using a validated molecular structural descriptor ^ropriate to whole molecules, 
calculating the descriptor distance between all seleaed product molecules and all 
other product molecules; 

g. Det^mining the shortest distance between each product molecule and all product 
S molecules previously selected; 

h. Selecting for induston in the subset the product molecule whose shortest descriptor 
distance firom the previously selected molecules is the largest and is greater than the 
naghborhood distance of the descriptor; 

L Repeatstepsf through li until the largest shortest difference between nM>lecules is 1^^ 
10 than the neighborhood distance of the descriptor; and 

j. Outputing a list of the selected product molecules and/or the reactant molecules from 
which the selected product molecules can be formed. 
45. The method of claim 44 in which the validated molecular structural descriptor 
appropriate to reactant molecules is topomeric CoMFA fields. 
15 46. The method of claim 45 in which topomeric hydrogen bond fields are used in 
conjunction widi the topomeric CoMFA fields descriptor. 

47. The method of claim 45 in which the validated molecular structural descriptor 
appropriate to whole molecules is the Tanimoto 2D coefficient. 

48. A computer-based method of det^mining the neighboriiood distance characteristic of 
20 a validated molecular structural descriptor using multiple literature data sets containing a 

variety of chemical structures and associated activities, comprising the following steps: 

a. Applying the molecular structural descriptor to all compounds represented in each 
data set to derive descriptor values; 

b. Constructing a Patt^son plot fonjeach molecular structural descriptor for each data 
25 set using die descriptor values for the compounds in each data set and their associated 

activities; 

c. Detomining the appropriate Patterson plot line for each data set; 

d. Using for each data set a point on the Y axis of the corresponding Patterson plot the 
end point of an activity difference for which a neighborhood distance is desired, 

30 determining the X axis values of the molecular structural descriptor corresponding to 

the projection from the Patterson plot line of the end points of the activity differwice; 

e. Determining the average range of values for the neighborhood distance from the plots 
for each of the data sets. 
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49» A method of detemuning the molecules within any set which are most likdy to have 
the same activity as a lead molecule previously identified in an assay comprising the following 
stqis: 

a. Characterizing the lead molecule and all crther compounds to be examined using a 
S validated molecular structural descriptor appropriate to whole molecules; 

b. Detomining the molecular ^nictural descriptor distances between the lead molecule 
and all the ofber molecules; and 

c. Identifying the molecules whose distances fipom the lead molecule fall within the 
neighborhood distance of the lead. 

10 50. Hie method of daim 49 further comprising the additional steps of: 

d. Determining the molecular structural descriptor distances between the set of 
molecules previously identified and all the otfier molecules excluding the lead and the 
sets; 

e. Identifying the molecules whose distances from molecules in the previously selected 
15 set fell within the ndghborhood distance; and 

f. Repeating steps d through £ as many times as desired. 

51. A method of determining the useful boundaries of exploration within any set of 
molecular structures for molecules possessing the same activity as a lead molecule previously 
identified in an assay comprising the following steps: 
20 a. Characteridng the lead molecule and all other compounds to be examined using a 

validated molecular structural descriptor appropriate to whole molecules; 

b. Determining the molecular structural descriptor distances between the lead molecule 
and all the other molecules; and 

c. Identifying the molecules whose distances from the lead molecule fell within the 
25 neighborhood distance of the lead; 

d. Synthesizing and testing in an assay the molecules identified in step c and if no 
activity is detected, stc^. 

e. If activity is detected, calculating molecular structural descriptor distances, from each 
molecule identified in Uie previous stq) as showing activity, to all oUier compounds 

30 (excluding Uie lead compound and each previously identified active compound); 

f. Identifying all molecules witiiin the ndghborhood diameter of the previously 
identified active molecules; 

g. Synthesizing and testing in an assay tiie molecules identified in Uie previous step, and 
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if no activity is detected, stop; and 
h. Rq)eating steps £ through g untU no further compounds show activity in the assay. 

52. A computer-based method of characterizing the three dimensional structure of 
reactants, which can assume many conformations, compriang the steps of: 

S a. Topomerically aligning the leactants; and 

b. Determining die CoMFA steric fields for each topomerically aligned reactant 

53. The method of claim 52 further conqiri^g the addition of topomeric hydn^en 
bonding fields to the CoMFA stoic fidds. 

54. A computer-based method of applying a molecular structural descriptor to a set of 
10 reactants compddng the following stq>s: 

a. Topomerically aligning the reactants; 

b. Determining the CoMFA stoic fields for each tc^merically aligned reactant; and 

c. Calculating the field diffoences between all pairs of reactants. 

55. The method of claim 54 further comprising after stq> b the additional stq) of adding 
IS topomeric hydrogen bonding fields to the CoMFA fields. 

56. The method of claim 54 further comprising after step g the additional stq> of 
hierarchically clustering the reactants until the intercluster distance is about 80 - 100 CoMFA 
field units. 

57. In a digital computer in which rq)resentations of specified reactant molecules and a 
20 core molecule have been stored, a computer-based method for selecting, for all possible 

product molecules which could be created in a combinatorial synthe^ from the reactant 
molecules and common core molecule, a subset of product molecules, comprising the following 
stq>s: 

a. Characterizing all the reactant molecules with a validated molecular structural 
25 descriptor appropriate to reactant nK>lecules; 

b. Hierarchically clustering the characterized reactant molecules until the intercluster 
distance corresponds to the neighborhood distance of the validated molecular 
structural descriptor or to a value close to the neighborhood distance which creates 
a logical clustering break; 

30 c. Selecting a reactant molecule from each cluster; 

d. Combinatorially assembling the selected reactant molecules and core molecule into 
products which would be created in the chemical synthesis; 

e. Selecting a product molecule for inclusion in the subset; 
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f. Using a validated molecular structural descriptor appropriate to whole molecules, 
calculating the descriptor distance between all sdected product molecules and all 
other product molecules; 

g. Determining the shortest distance between each product molecule and all product 
molecules previously selected; 

h. Selecting for indusion in die subset the product molecule whose shortest descriptor 
distance from the previously sdected molecules is the largest and is greater than the 
ndghborfaood distance of the desoiptor; 

I Rq)eatstq>sf through htmtil the largest shortest diffine^ 

than the neightiorhood distance of the descriptor; and 
j. Ou^uting a list of the selected product molecules and/or the reactant molecules from 

which the selected product molecules can be formed. 

58. The method of claim 57 in which the validated molecular structural descriptor 
apprq>riate to reactant molecules is topom^c CoMFA fields. 

59. The method of claim 58 in which topomeric hydrogen bond fields are used in 
conjunction with the topomeric CoMFA fidds descriptor. 

60. Hie method of claim 57 in which the validated molecular structural descriptor 
s^ropriate to whole molecules is the Tanimoto 2D coefficient. 

61 . A computer4>ased method for generating a virtual library of possible combinatorially 
derived product molecules which can be searched for product molecules ha^g desired 
properties without the necessity of generating the product structures during the search, 
comprising the following steps: 

a. Creating one or more files identifying one or more combinatorial reactions for one or 
more core structures; 

b. Creating sqiarate structural variation files (associated with the reaction identifying files) 
in which are listed togettier the structural variations repr^ntative of those reactants 
which will react at each variation site of each combinatorial reaction; 

c. Associating with each structural variation, data, charactCTizing each structural variation 
including: 

(1) Characterization data, taking into account when necessary the structures of the 
cores wth which the structural variations would be combined in the listed 
combinatorial syntheses, which has not been derived from the application of 
validated molecular structural descriptors; and 
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(2) Characterizing data, taking into account when necessary the structures of Che cores 
with which the structural variations would be combined in the listed combinatorial 
syntheses, whidi has hem derived from flying validated molecular structural 
descriptors to the structural variations. 
S 62. A virtual Ubrary of possible combinatoriaUy derived imduct molecules which ca^ 
seardied for i»oduct mbleoiles having desired prop^es without the necessity of generating 
the product structures during the seardi, generated by the following process: 

a. Creating one or more files identifying one or more combinatorial reactions for one or 
more core stru(^res; 

10 b. Creating separate structural variation files (associated with the reaction identifying files) 
in which are listed together the structural variations rq)resentative of those reactants 
which will react at each variation site of each combinatorial reaction; 
c. Associating with each structural variation, data, characterizing each structural variation 
including: 

15 (1) Charactmzation data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial syntheses, which has not been derived ftom the application of 
validated molecular structural descriptors; and 
(2) Characterizing data, taking into account when necessary the structures of the cores 
20 with which the structural variations would be combined in the listed combinattHial 

syntheses, which has been derived from applying validated molecular structural 
descriptors to the structural variations. 
63. Tht m^hod of claim 61 further comprising a comput^^based method for sdecting 
from the virtual library, for all possible product molecules which could be created by all 
25 combinatorial arrangements of specified structural variations and a common core molecule, a 
subset of product molecules, comprising the following additional steps: 

b. identifying all possible combinatorial product molecules which could result from the 
specified reactants and selected core molecules; 

c. selecting from all possible combinatorial product molecules a product molecule for 
30 inclusion in the subset; 

d. using a validated molecular descriptor appropriate to whole molecules with which the 
Virtual Library was generated, removing from the set of all remaining molecules those 
molecules fsdling within a chosen neighborhood distance of the selected molecule; 
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e. using a validated molecular descriptor 2y>propriate to the structural variations with which 
the Virtual Library was generated, removing ftom the set of all lemaining product 
molecules those molecules formed from structural variations falling within a chosen 
neighborhood distance of the structural variations of the selected molecule; 
S f . selecting from the set of all product niolecules remaining after step e a product molecule 
for inclusion in the subset; 

g. rq)eating steps d through f until no additional product molecules remain to be selected 
in stq) f ; and 

h. Quitting a list of the sdected subset and/or the structural variations from which die 
10 subset can be formed. 

64. The method of claim 61 further comprising a computer-based method for selecting 
from the virtual library, for all possible product molecules which could be created by all 
combinatorial arrangements of specified structural variations and core molecules, a subset of 
product molecules, comprising the following additional steps: 
15 b. selecting from all possible cores a core upon which to base the subset; 

c. using a validated molecular descriptor appropriate to cores, selecting from the set of all 
possible cores those core molecules falling within the ndghborbood distance of the 
selected core molecule; 

d. identifying all possible combinatorial product molecules which could result from the 
20 specified structural variations and selected core molecules; 

e. selecting from all possible combinatorial product molecules a product molecule for 
inclu^on in the subset; 

f. using a validated molecular descriptor J^ropriate to whole molecules with which the 
Virtual library was generated, removing ftom the set of all remaining molecules those 

25 molecules falling within a chosen ndghborhood distance of the sdected molecule; 

g- using a validated molecular descriptor appropriate to the structural variations with which 
the Virtual Library was generated, removing from the set of all remaining product 
molecules those molecules formed from structural variations falling within a chosen 
neighborhood distance of the structural variations of the selected molecule; 

30 h, selecting from the set of all product molecules remaining after step g a product molecule 
for inclusion in the subset; 

i. rq)eating steps t through h until no additional product molecules remain to be selected 
in step h; and 
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j. Ou^utdng a list of the sdected subset and/or the structural variations and cores from 

which the subset can be formed. 
65. The method of claim 61 furtho- comprising a computer-based method for selecting 
from the virtual library, for all possible product molecules which could be created by all 
S combinatorial arrangements of specified structural variations and a common core molecule, a 
subset of product molecules, conq^ri^g the following additional stq>s: 

b. identifymg all pos^ble combinatorial product molecules whidi could result from the 
^)ecified reactants and selected core molecules; 

c. selecting from all possible combinatorial i»oduct molecules a product molecule for 
10 inclusion in the subset; 

d. using a validated molecular descriptor appropriate to whole molecules with which the 
Virtual Library was goierated, removing from the set of all remaining molecules those 
molecules falling within the neighborhood distance of the selected molecule; 

e. sdecting from the set of all product molecules renudning after step d a product molecule 
IS for inclusion in the subset; 

f. repeating steps d through e until no additional product molecules remain to be selected 
in stq) f ; and 

g. Ouputting a list of the selected subset and/or the structural variations from which the 
subset can be formed. 

20 66. The method of daim 61 further comprising a computer-based method for selecting 
from the virtual library, for all possible product molecules which could be created by all 
combinatorial arrangements of specified structural variations and a common core molecule, a 
subset of product molecules, comprising the following additional steps: 

b. identifying all posable combmatorial product molecules which could result from the 
25 specified reactants and selected core molecules; 

c. selecting from all possible combinatorial product molecules a product molecule for 
inclusion in the subset; 

d» using a validated molecular descriptor appropriate to the structural variations with which 
the Virtual Library was generated, removing from tiie set of all remaining product 
30 molecules those molecules formed from structural variations falling within a chosen 

neighborhood distance of the structural variations of the selected molecule; 
e, selecting from the set of all product molecules remaining after step d a product molecule 
for inclusion in tiie subset; 
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f. repeating stq;>s d through e until no additional product molecules remain to be selected 
in step e; and 

g. Ouputting a list of die selected subset and/or die structural variations ftom which dw 
subset can be formed. 

67. A soeening library deagned by a computer-based medwdwhicAsdects die screening 
library molecules from dtose molecules which could be created by all combinatorial 
arrangements of specified sinictuial variations and a common core molecule comprising die 
following stqis: 

a. generating a virtual library by: 

(1). creating one or more files identifying one or more combinatorial reactions for one 
or more core structures; 

t2). creating separate structural variation files (associated widi die reaction identifying 
files) in which are listed tpgedier die structural variations rj^iresentative of diose 
reactants whi(* will react at each variation site of each combinatorial reaction; 

(3). associating widi each structural variation, data, characterizing each structural 
variation including: 

(a) , characterization data, taldng into account when necessary die strurtures of die 

cores widi which die structiiral variations would be combined in die listed 
combinatorial syndieses, which has not been derived from dw appfication of 
validated molecular structural descriptors; and 

(b) . characterizing data, taking into account when necessary die structures of die 

cores widi which die structural variations would be combined in die listed 
combinatorial syndieses. which has been derived ftom applying validated 
molecular structural desoiptors to die structural variations; 

b. identifying in die virtual library all possible combinatorial product molecules which 
could result from die specified reactants and selected core molecules; 

c. selecting from all possible combinatorial product molecules a product molecule for 
inclusion in die subset; 

d. using a validated molecular descriptor appropriate to whole molecules widi which die 
Virtual Ubrary was generated, removing from die set of al! remaining molecules diose 
molecules faUing widiin a chosen neighborhood distance of die selected molecule; 

e. using a validated molecular descriptor appropriate to die structural variations widi which 
die Virtual Library was generated, removing from die set of all remaining product 
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molecules those molecules formed firom structural variations falling within a chosen 
ndghborhood distance of the structural variati(ms of the selected molecule; 
f. selecting from the set of all product molecules remaining after step e a product molecule 
for inclusion in the subset; 
S g« repeating steps d through f until no additional product molecules remain to be selected 
in stq) f; and 

h. Outputting a list of the selected subset and/or the structural variations from which the 
subset can be formed, 

68. A screening library designed by a comput^-based method which selects the scre^ung 
10 library molecules from ttiose molecules which could be created by all combinatorial 
arrangements of qiecified structural variations and core molecules comprising the following 
stq>s: 

a. generating a virtual library by: 

(1) . creating one or more files id^tifying one or more combinatorial reactions for one 
15 or nrtore core structures; 

(2) , creating sqwrate structural variation files (associated with the reaction identifying 

files) in Y/tddi are listed together the structural variations representative of those 
reactants which will react at each variation site of each combinatorial reaction; 

(3) . associating with each structural variation, data, characteriring each structural 
20 variation including: 

(a), characterization data, taking into account when necessary the structures of die 
cores with which tiie structural variations would be combined in the listed 
combinatorial syntheses, which has not been derived from the application of 
validated molecular structural descriptors; and 
25 (b), characterizing data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial syntiieses, which has been derived from applying validated 
molecular structural descriptors to the structural variations; 
b. selecting from all possible cores a core upon which to base the subset; 
30 c. using a validated molecular descriptor sy}propriate to cores, selecting from the set of all 
possible cores those core molecules falling within the neighborhood distance of Ihe 
selected core molecule; 
d, identifying all possible combinatorial product molecules which could result from the 
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q>edfied reactants and selected core molecules; 
e. selecting from all possible combinatorial product molecules a product molecule for 

inclusion in the subset; 
f . using a validated molecular descr^r appropriate to wturie molecules with which the 
5 Virtual library was generated, removing from the set of all lemaimng molecules those 

molecules fdling within a chosen neighborhood distance of the selected molecule; 

g. udng a validated niolecular descriptor appnq>riate to the structural variation 

the Virtual library was generated^ removing from the set of all remaining product 
nu>lecules those molecules formed from structural variations falling within a chosen 
10 ndghbortiood distance of the structural variaticms of the selected molecule; 

h. selecting from the set of all product molecules remaining after step g a product molecule 
for inclusion in the subset; 

i. rq>eating stq>s f through h until no additional product molecules remain to be selected 
in step h; and 

15 j. OuQ)utting a list of the selected subset and/or the structural variaticms and cores from 
which the subset can be formed. 
69. The use of a subset of molecules, which could be made in a combinatorial synthesis 
of specified reactants and common core, to specify the compounds to be synthesized and tested 
in appropriate assays, said subset being sdected by the following computer-based method: 
20 a. generating a virtual library by: 

(1) . creating one or more files identifying one or more combinatorial reactions for one 

or more core structures; 

(2) . creating separate structural variation files (associated with the reaction identifying 

files) in which are listed together die structural variations rq>resentative of those 
25 reactants which will react at each variation site of each combinatorial reaction; 

(3) . associating with each structural variation, data, characterizing each structural 

variation including: 

(a) , characterization data, taking into account when necessary the structures of the 

cores witii which Uie structural variations would be combined in the listed 
30 combinatorial synUieses, which has not been dmved from tfie application of 

validated molecular structural descriptors; and 

(b) . characterizing data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
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combinatorial syntheses, wiiich has been derived firom applying validated 
molecular structural descriptors to the structural variations; 
b. identifying in the virtual library all possible combinatorial product molecules which 
could result from the spedfted leactants and selected core molecules; 
S c. selecting from all possible combinatorial product molecules a product molecule for 
inclusion in the subs^; 
d. u^ng a validated molecular descriptor a{q>rapriate to whole molecules with which the 
Wiitml library was generated, removing from the set of all remaining molecules those 
molecules falling within a chosen ndghborhood distance of the selected mcriecule; 
10 e. udng a validated molecular ctescriptoraipx>priate to the structural varia^ 

the Virtual Library was generated, removing from the set of ail remaining product 
molecules those molecules formed from structural variations falling within a chosen 
ndghborhood distance of the structural variations of the selected molecule; 

f . selecting from the set of all product molecules remaining after step e a product molecule 
IS for inclusion in the subset; 

g. rqjeating steps d through f until no additional product molecules remain to be selected 
in stq> f; and 

h. Outputting a list of the selected subset and/or the reactants from which the subset can 
be formed. 

20 70. The molecules selected, from those which could be made in a combinatorial synthesis 
of specified reactants and common core, by the following computer-based method: 
a. generating a virtual library by: 

(1). creating one or nKire files identiiying one or more combinatorial reactions for one 
or more core structures; 

25 (2). creating sq^aiate structural variation files (associated with the reaction identifying 

files) in which are listed together the structural variations representative of those 
reactants which will react at each variation site of each combinatorial reaction; 
(3). associating with each structural variation, data, characterizing each structural 
variati<Hi including: 

30 (a), characterization data, taking into account when necessary the structures of the 

cores witfi which the structural variations would be combined in the listed 
combinatorial syntheses, which has not been derived from the application of 
validated molecular structural descriptors; and 
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(b). characterizing data, taking into account wh^ necessary the structures of the 
cores with which the structural variations would be combined in the listed 
combinatorial syntheses, wtuch has beeti derived from aiq)lying validated 
molecular structural descriptors to the structural variations; 
5 b. identifying in the virtual library all possible combinatorial product molecules whidi 
could result horn the q)ecified reactants and core molecule; 

c. sdecting from all possible combinatorial produa molecules a product molecule for 
inclusion in the subset; 

d. uang a validated molecular descriptor appropriate to wh(de molecules with which the 
10 Virtual Lft>rary was generated, removing from the set of all remaining molecules those 

molecules falling within a chosen neighborixxxi distance of the selected molecule; 

e. using a validated molecular descriptor appropriate to the structural variations with which 
the Virtual library was generated, removing from the set of all remaining product 
molecules those molecules formed from structural variations failing within jsl chosen 

15 neighborhood distance of the structural variations of the selected molecule; 

f . selecting from the sa of all product molecules remaining after step e a product molecule 
for inclusion in the subset; 

g. rq)eating stq>s d through f until no additional product molecules remain to be selected 
in stq) f; and 

20 h. Quitting a list of the selected subset and/or the reactants from which the subset can 
be formed. 

71. The molecules selected, from those which could be inade in a combinatorial synthesis 
of specified reactants and cores, by the following computer-based method: 
< a, genoating a virtual library by: 
25 (1). creating otc or more files identifying one or more combinatorial reactions for one 

or more core structures; 
(2). creating sq>arate structural variation files (associated with the reaction identifying 
fdes) in which are listed together die structural variations repres^tative of those 
reactants vMch will react at each variation site of each combinatorial reaction; 
30 (3). associating with each structural variation, data, characterizing each struaural 

variation including: 

(a), characterization data, taking into account when necessary the structures of the 
cores with which the structural variations would be combined in the listed 
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combinatorial syntheses, which has not been derived from the application of 
validated molecular structural descriptors; and 
(b). characterizing data, taking into account when necessary the structures of the 
cores with which the structural^ variations would be combined in the listed 
S combinatcHial syntheses, which has been derived from sy^lying validated 

molecular structural descriptors to the structural variations; 

b. selecting from all possible axes a core upon-which to base the subset; 

c. using a validated molecular desoiptor a^^nopriate to cores, selecting from the set of all 
possible cores those core molecules falling within the ndghborhood distance of tfte 

10 selected core molecule; 

d. idaitifying all pos^ble combinatorial product molecules which could result from the 
spedfied reactants and selected core molecules; 

e. selecting from ail possible combinatorial product molecules a product molecule for 
inclusion in the subs^; 

IS f. using a validated molecular descriptor appropriate to whole molecules with which the 
Virtual Library was generated, removing from the set of all remaining molecules those 
molecules falling within a chosen ndghborhood distance of the sdected molecule; 

g. using a validated molecular descriptor s^rppriate to the structural variations with which 
the Virtual Library was generated, removing from the set of all remaining product 

20 molecules those molecules formed from structural variations falling within a chosen 

neighborhood distance of the structural variations of the selected molecule; 

h. selecting from the set of all product molecules remaining after step g a product molecule 
for inclusion in the subset; 

i. repeating stq>s f through h until no additional product molecules remain to be selected 
25 in step h; and 

j. Ou^utting a list of the selected subset and/or the reactants from which the subset can 
be formed. 

72. The method of claim 1 further comprising a computer-based method for selecting 
from the virtual library, for all possible product molecules which could be created by all 
30 combinatorial arrangements of specified structural variations and a common core molecule, a 
subset of product molecules, comprising the following additional steps: 

b. identifying all possible combinatorial product molecules which could result from the 
specified reactants and selected core molecules; 
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c. selecting from all possible combinatorial product molecules a product molecule for 
inclusion in the subset; 

d. using a combination validated molecular descriptor charactmzing both whole molecule 
and structural variation features widi whidi the Virtual Library was generated, removing 

S ftom the set of all remaining molecules those molecules falling within a chosen 

neighborhood distance of the selected molecule; 

e. sde^g from the set of all product molecules rennaining after stqi da pr^ 
for inclu^n in the subs^ 

f. repeating steps d through e imtil no additional product molecules remain to be selected 
10 in step e; and 

h. Outputting a list of the selected subset and/or the structural variations from which the 
subset can be formed. 

73. The method of daim 61 further comprising a computer-based method for selecting 
from the virtual Ubrary, for all possible product molecules which could be created by all 
IS combinatorial arrangements of specified structural variations and core molecules, a subset of 
product molecules, comprising the following additional stq>s: 

b. selecting from all possible cores a core upon which to base the subset; 

c. using a validated molecular descriptor appropriate to cores, selecting from the set of all 
possible cores those core molecules falling within the neighborhood distance of the 

20 selected core molecule; 

d. identifying all possible combinatorial product molecules which could result from the 
specified structural variations and sdected core molecules; 

e. selecting from all possible combinatorial product molecules a product mdecule for 
indusicm in the subset; 

25 f. using a combination validated molecular descriptor characterizing both whole molecule 
and structural variation features with which the Virtual Ubraiy was generated, removing 
from the set of all remaining molecules those molecules falling within a chosen 
neighborhood distance of the selected molecule; 

g. selecting from the set of all product molecules remaining after step e a product molecule 
30 for inclusion in the subset; 

f. repeating steps e through g until no additional product molecules remain to be selected 
in step g; and 

h. Outputting a list of the selected subset and/or the structural variations and cores from 
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which the subset can be formed. 
74. The molecules selected, from those which could be made in a combinatorial synthesis 
of specified reactants and common core, by the following computer-based method: 
a. gen^ating a virtual library by: 
S , (1). creating <Hie or more files identifying one or more combinatorial reactions for one 
or more core structures; 
(2). creating-sqiarate structural variation files (associated with the reaction identifying 
files) in which are listed together the structural variations rq>iesentative of those 
reactants which will react at each variation site of each combinatorial reaction; 
10 (3). associating with each structural variation, data, characterizing each structural 

variation including: 

(a) , charactoization data, taking into account when necessary the structures of the 

cores with which the structural variaticHis would be combined in the listed 
combinatorial syntheses, which has not been derived from the application of 
IS validated molecular structural descriptors; and 

(b) . charactoizing data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial syntheses, which has been derived from applying validated 
molecular structural descriptors to the structural variations; 
20 b. identifying in the virtual library all possible combinatorial product molecules which 
could result from the specified reactants and core molecule; 

c. selecting from all possible combinatorial product nnolecules a product molecule for 
inclusion in the subset; 

d. using a combination validated molecular descriptor characterizing both whole molecule 
2S and structural variation features with which the Virtual Library was generated, removing 

from the set of all remaining molecules those molecules falling within a chosen 
neighborhood distance of the selected molecule; 

e. selecting from the set of all product molecules remaining after step d a product molecule 
for inclusion in the subset; 

30 f. repeating steps d through e until no additional product molecules remain to be selected 
in step e; and 

^ h. Outputting a list of the selected subset and/or the reactants from which the subset can 
be formed. 
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75. The molecules selected, from those which could be made in a combinatorial synthesis 
of specified reactants and cores, by the following computer-based method: 

a. g^erating a virtual library by: 

(1) , creating one or more files identifying one or more combinatorial reactions for one 
5 or more core structures; 

(2) . creating sq»rate structural variation files (associated with the reaction identifyir^ 

files) in which are listed togetho- the structural variations representative of those 
reactants which will react at each variation ^te of each combinatorial reaction; 

(3) . associating with each structural variation, data, charactoizing each structural 
10 variation including: 

(a), characterization data, taking into account when necessary the structures of the 
cores with which the structural variations would be combined in the listed 
combinatorial syntheses, which has not been derived from the application of 
validated molecular structural descriptors; and 
15 (b). characterizing data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial syntheses, which has been derived from applying validated 
molecular structural descriptors to the structural variations; 

b. selecting from all possible cores a core upon which to base the subset; 

20 c. using a validated molecular descriptor sqq>ropriate to cores, selecting from the set of all 

posfflble cores those core molecules falling within the neighborhood distance of the 

selected core molecule; 
d. identifying all possible combinatorial product molecules which could result from the 

specified reactants and selected core molecules; 
25 e. selecting from all possible combinatorial product molecules a product molecule for 

inclusion in the subset; 

f . using a combination validated molecular descriptor characterizing both whole molecule 
and structural variation features with v^ich the Virtual library was generated, removing 
from the set of all remaining molecules those molecules falling within a chosen 

30 ndghborhood distance of the selected molecule; 

g. selecting from the set of aU product molecules remaining after step f a product molecule 
for inclusion in the subset; 

f. repeating steps f through g until no additional product molecules remain to be selected 
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in step g; and 

h. Quitting a list of the selected subset and/or the reactants and cores from which the 
subset can be formed. 

76. The method of claim 61 further compri^ng a method of determining within the virtual 
S library » the molecules whidi could be created by all combinatx»ial arrangements of specified 

structural variations and a common core molecule, which are most likely to have the same type 
of activity-as a molecule of interest comprising the following steps: 

a. identifying in the virtual Ubraiy all possible combinatorial product molecules which 
could result from the specified reactants and sdected core molecules; 
10 b. characterizing the molecule of interest with a validated molecular structural descriptor 
appropriate to whole molecules with which the virtual library was generated; 
d, using the same validated molecular descriptor appropriate to whole molecules, sdecting 
the set of all possible molecules whose descriptor values foil within a chosen 
neighborhood distance of the selected molecule; and 
15 g. Ouputdng a list of the selected subset and/or the structural variations from which the 
subset can be formed. 

77. The m^od of claim 61 further comprising a method of det^mining within the virtual 
library, the molecules which could be created by all combinatorial arrangements of specified 
structural variations and a common core molecule, which are most likely to have the same type 

20 of activity as a molecule of interest comprising the following steps: 

a. identifying in the virtual library all possible combinatorial product molecules which 
could result from the specified reactants and selected core molecules; 

b. characterinng the molecule of interest with a validated molecular structural descriptor 
ai^ropriate to structural variations with whiduhe virtual library was generated; 

25 d. using the same validated molecular descriptor appropriate to structural variations, 
selecting the set of all possible molecules whose descriptor values fall within a chosen 
neighborhood distance of the selected molecule; and 
g. Ouputting a list of the selected subset and/or the structural variations from which the 
subset can be formed. 

30 78. The mediod of claim 61 further comprising a method of determining within the virtual 
library, the molecules which could be created by all combinatorial arrangements of specified 
structural variations and a common core molecule, which are most likely to have the same type 
of activity as a molecule of interest comprising the following steps: 
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a. identifying in the virtual library all possible combinatorial product molecules which 
could result from the specified reactants and selected core molecules; 

b. characterizing the molecule of interest with both a validated molecular structural 
descriptor apiHOpriate to structural variations with which the virtual library was 

S generated and with a validated molecular structural descriptor appropriate to structural 

variations with which the virtual library was generated; 

d. using the same validated molecular descriptor appropriate to whole molecules, selecting 
the set of aU pos^le molecules whose descriptor values fisdl within a chosen 
neighborhood distance of the selected molecule, and using the same validated molecular 

10 descriptorapprc^riatetostructural variations, sdecting the set of aU possible molecule 

whose descriptor values Ml within a chosoi neighboriuxxl distance of the selected 
molecule; and 

e. Ouputting a list of the selected subset and/or the structural variations from which the 
subset can be formed. 

15 79. The method of claim 61 further comprising a method of determining within the virtual 
library, the molecules which could be created by all combinatorial arrangements of specified 
structural variations and a common core molecule, which are most likely to have the same type 
of activity as a molecule of interest comprising the following steps: 

a. identifying in the virtual library all pos^ble combinatorial product molecules which 
20 could result from the spedfied reactants and selected core molecules; 

b. charactmzing the molecule of interest with a combination validated molecular 
descriptor, charact^izing both whole molecule and structural variaticm features, with 
which the Virtual Library was generated; 

d, using the same validated molecular descriptor, selecting the set of all possible molecules 
25 whose descriptor values fall within a chosen neighborhood distance of the selected 

molecule; and 

g. Ouputting a list of the selected subset and/or the structural variations from which the 
subset can be formed. 

80. The molecules, which are most likely to have the same type of activity as a molecule 
30 of interest, selected, from those which could be made in a combinatorial synthesis from 
specified reactants and a common core molecule, by the following computer-based method: 
a. generating a virtual library by: 

(1). creating one or more files identifying one or more combinatorial reactions for one 
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or more core structures; 
(2). creating separate structural variation files (associated \vith the reaction identifying 
flies) in which are listed together the structural variations rq>res^tative of those 
reactahts whidi will react at each variation site of each combinatorkd reaction; 
S (3). assodating with each structural variation, data, characterizing each structural 

variation including: 

(a) , characterization data, taking into account when necessary the strucuires of the 

cores with whidi the structural variations would be combined in the listed 
combinatorial syntheses, which has not been d^ved from the application of 
10 validated molecular structural descriptors; and 

(b) . characterizing data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial syntheses, which has been derived from applying validated 
molecular structural descriptors to the structural variations; 
IS b. identifying in tiie virtual library all possible combinatorial product molecules which 
could result from the spedfied reactants and selected core molecules; 

c. characterizing the molecule of interest with both a validated molecular structural 
descriptor appropriate to structural variations with which the virtual library was 
generated and with a validated molecular structural descriptor appropriate to structural 

20 variations with which the virtual library was generated; 

d. using the same validated molecular descriptor a|qiropriate to whole molecules, selecting 
the set of all possible molecules whose descriptor values fall within a chosen 
ndghborhood distance of the sdected molecule, and using the same validated molecular 
descriptor j^ypropriate to structural variations, selecting the set of all possible molecules 

25 whose descriptor values foil within a chosen ndghborhood distance of the selected 

molecule; and 

e. Ouputting a list of the selected subset and/or the reactants from which the subset can be 
formed. 

81. The molecules, which are most likely to have the same type of activity as a molecule 
30 of interest, selected, from those which could be made in a combinatorial synthesis from 
specified reactants and a common core molecule, by the following computer-based method: 
a. generating a virtual library by: 

(1). creating one or more files identifying one or more combinatorial reactions for one 
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or more core structures; 

(2) . creating separate structural variation files (associated with the reaction identifying 

fdes) in which are listed together the structural variations representative of those 
reactants which will react at each variation site of each combinatorial reaction; 

(3) . associating with each structural variation, data, characterizing each structural 

variation including: 

(a) , characterization data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial syntheses, which has not been derived from the application of 
validated molecular structural descriptors; and 

(b) . characterizing data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial syntheses, which has been derived from ^plying validated 
molecular structural descriptors to the structural variations; 

b. identifying in the virtual library all possible combinatorial product molecules Which 
could result from the specified reactants and selected core molecules; 

c. diaracterizing the molecule of interest with a combination validated molecular 
descriptor, characterizing both whole molecule and structural variation features, with 
which the Virtual Library was generated; 

d. using the same validated molecular descriptor, selecting the set of all possible molecules 
whose descriptor values fell within a chosen neighborhood distance of the selected 
molecule; and 

e. Ouputting a Ust of the selected subset and/or the reactant from which the subset of 
molecules can be formed. 

82. The use of a subset of molecules, which are most likely to have the same type of 
activity as a molecule of interest and selected from those which could be made in a 
combinatorial synthesis from specified reactants and a common core molecule, to specify the 
compounds to be synthesized and tested in appropriate assays, said subset being selected by 
the following computer-based method: 

a. generating a virtual library by: 

(1) . creating one or more files identifying one or more combinatorial reactions for one 

or more core structures; 

(2) , creating separate structural variation files (associated with the reaction identifying 
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files) in which are listed together the structural variations repres^t^ve of those 
reactants which will react at each variation site of each combinatorial reaction; 
(3) associating with each structural variation, data, characterizing each structural 
variation including: 

5 (a), characterization data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial-Syntheses, which has not been derived from the application of 
validated molecular structural descriptors; and 
(b). characteri^ng data, taking into account when necessary the structures of the 
10 cores with which the structural variations would be combined in the listed 

combinatorial syntheses, which has been derived from sqjplying validated 
molecular structural de^riptors to the structural variations; 
b. identifying in the virtual library all possible combinatorial product molecules which 
could result from the q)ecified reactants and selected core molecules; 
IS c. selecting from all possible combinatorial product molecules a product molecule for 
inclusion in the subset; 

d. charactmzing the molecule of interest with both a validated molecular structural 
descriptor appropriate to whole molecules with which the virtual library was generated 
and with a validated molecular structural descriptor appropriate to structural variations 

20 with which the virtual library was generated; 

e. using the same validated molecular descriptor s^ropriate to whole molecules, selecting 
the set of all possible molecules whose descriptor values fall within a diosen 
ndghborhood distance of the selected molecule, and using the same validated molecular 
descriptOT appropriate to structural variations, selecting the set of all possible molecules 

25 whose descriptor values fall within a chosen neighborhood distance of the selected 

molecule; and 

f. Ouputting a list of the selected subset and/or the reactants from which the subset can be 
formed. 

83. The use of a subset of molecules, which are most likely to have the same type of 
30 activity as a molecule of interest and selected from fliose which could be made in a 
combinatorial synthesis from specified reactants and a common core molecule, to specify the 
compounds to be synthesized and tested in appropriate assays, said subset being selected by 
the following computer-based method: 
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a. generating a virtual library by: 

(1) . creating one or more fUes identifying one or more combinatorial reactions for one 

or more core structures; 

(2) . cnatingsepanitestructund variation ffles(assodaled with the re^ 

5 files) in which are listed together the structural variations representative of those 

reactants which will react at each variation site of each combinatorial reaction; 

(3) . associating with each structural variation, data, characterizing each structural 

variation including: 

(a) , characterization data, taking into account when necessary tiie structures of the 
^® "'res with which the structural variations would be combined in the listed 

combinatorial syntheses, which has not been derived from tiie application of 
validated molecular structural descriptors; and 

(b) . characterizing data, taking into account whai necessary the structures of die 

cores with which tiie structural variations would be combined in the listed 
combinatorial syntheses, which has been derived from applying validated 
molecular structural descriptors to die structural variations; 

b. identifying in die virtual library all possible combinatorial product molecules which 
could result from Uie specified reactants and selected core molecules; 

c. selecting from all possible combinatorial product molecules a product molecule for 
20 inclusion in the subset; 

d. characterizing tiie molecule of interest with a combination validated molecular 
descriptor, characterizing botii whole molecule and structural variation features, witfi 
which tiie Virtual Library was generated; 

e. using tiie same validated molecular descriptor, selecting tiie set of all possible molecules 
25 whose descriptor values fell witiiin a chosen neighborhood distance of tiie selected 

molecule; and 

f. Ouputting a list of tfie selected subset and/or tiie reactant from which tiie subset of 
molecules can be formed. 

84. The metiiod of claim 61 further comprising a metiiod of determining witiiin tiie virtual 
30 library, tiie molecules which could be created by all combinatorial arrangements of specified 
structural variations and core molecules, which are most likely to have tiie same type of 
activity as a molecule of interest, comprising tiie following steps: 

a, selecting from all possible cores a core upon which to base tiie subset; 



wo 97/27559 PCTAJS97/01491 

678 

b. using a validated molecular descriptor appropriate to cores, selecting from the set of all 
possible cores those core molecules Ming within the ndghborhood distance of the 
selected core molecule; 
c* identifying all possible combinatorial product nu>lecules which could result from the 
5 specified reactants and sdected core molecules; 

d. selecting and characterizing the molecule of intere^ with a validated molecular structural 
desCTii^r appropriate to whole molecules with which the virtual library was gaierated; 

e. using the same validated nnolecular descriptor appropriate to whole molecules, sdecting 
the set of all possible molecules whose descriptor values fall within a chos^ 

10 ndghborhood distance of the selected molecule; and 

f. Ouputting a list of the selected subs^ and/or the structural variations from which the 
subset can be formed. 

85 . The method of claim 61 furtter comprising a method of determining within the virtual 
library, the molecules which could be created by all combinatorial arrangements of structural 
15 variations and core molecules, which are most likely to have the same type of activity as a 
molecule of interest, which is not known to be derived from a combinatorial reaction, 
oompri^ng the following steps: \ 

a. fragmenting the molecule of interest as described in a fragmentation table; 

b. sdecting a fragmentation patt^; 

20 c. aligning the fragmenU according to topomeric alignment rules; 

d. generating CoMFA fidds for each aligned fragment; 

e. identifying which reaction types within the virtual library correspond to the reaction type 
resulting from the fragmentation; 

f. identifying whetiier the fragmentation pattern generated a core, and, if so, implementing 
25 tiie following stq)s: 

(1) characterizing the core with CoMFA fields; and 

(2) identifying, by comparing the field values, whether the core resembles any cores 
used in the creation of the virtual library; 

g. sdecting structural variations which were used in generating the virtual library with 
30 cores which matched the core resulting from the fragmentation; 

h. comparing the CoMFA fidds of the topomerically aligned fragments with the fields of 
the identified structural variations by taking the root sum of squares fidd differences; 

i. selecting those structural variations for which the root sum of squares field differwice 
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falls within a chosen neighborhood value; 
j. ouputting a list of the selected subset and/or the structural variations from which the 

subset can be forme; 
k. repea&ng steps b through j for all possible fragments. 

86. llie molecules, which are most likely to have the same type of acti^aty as a molecule 
of interest wiiich is not known to be derived from a combinatorial reaction , selected fiom those 
product molecules vMch could be created by all combinatorial arrangements of structural 
variations and core molecules, by the following comfniter-based method: 

a. goierating a virtual library by: 

(1) . creating one or more files identifying one or more combinatorial reactions for one 

or more core structures; 

(2) . creating separate structural variation files (associated with the reaction identifying 

files) in which are listed together the structural variations representative of those 
reactants which will react at each variation site of each combinatorial reaction; 

(3) . associating with each structural variation, data, characterizing each structural 

variation including: 

(a) , characterization data, taking into account wh«i necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial syntheses, which has not been d^ved from the application of 
validated molecular structural descriptors; and 

(b) . characterizing data, taking into account when necessary the structures of the 

cores with which the structural variations would be combined in the listed 
combinatorial syntheses, which has been derived from applying validated 
molecular structural descriptors to the structural variations; 

b. fragmenting the molecule of interest as described in a fragmentation table; 

c. selecting a fragmentation pattern; 

d. aligning the fragments according to topomeric alignment rules; 

e. gCTerating CoMFA fields for each aligned fiagment; 

f. idwitifying which reaction types within the virtual library correspond to the reaction type 
resulting from the fragmentation; 

g. identifying whether the fragmentation pattern generated a core, and, if so, implementing 
the following steps: 

(1) characterizing the core with CoMFA fields; and 



WO^/27559 PCr/US97/0149l 

680 

(2) identifying, by comparing the field values, whether the core resembles any cores 
used in the creation of the virtual library; 
h. selecting structural variations which were used in genmting the virtual library with 
cores which matched the core resulting from the fragmentation; 
S i. comparing the CoMFA fields of the topomerically aligned fragments with the fields of 
the identified structural variaticms by taldng the root sum of squares field differences; 
j. selecting those structural variations for which the root sum of squares field diffidence 

falls within a chosen neighborhood value; 
k. ouputting a list of the selected subset and/or the structural variations from which the 
10 subset can be forme; 

1. rq)eating steps c through k for all possible fragments. 

87- The method of daims 63 or 65 or 69 or 71 or 72 or 73 or 74 or 75 or 80 or 86 or 
88 in which the following additional step is performed immediately after the stq) of using a 
validated molecular descriptor appropriate to whole molecules: 
IS t. repeating the previous step for another validated molecular descriptor appropriate to 

whole molecules with which the Virtual Library was generated until no additional 
whole molecule descriptor remains to be used. 
88. The method of claims 63 or 65 or 70 or 71 or 72 or 73 or 74 or 75 or 81 or 86 in 
which the following additional step is performed immediately after the stq> of using a validated 
20 molecular descriptor appropriate to structural variations: 

u. rq;>eating the previous step for another validated molecular descriptor appropriate to 
structural variations with which the Virtual Library was generated until no additional 
structural variation descriptor remains to be used. 
89* The method of claim 63 in which the additional $tq> t is performed immediately after 
25 the step of using a vaUdated molecular descriptor appropriate to whole molecules and further 
in which step u is performed immediatdy after the step of using a validated molecular 
descriptor appropriate to structural variations: 

t. repeating the previous step for another validated molecular descriptor appropriate to 
whole molecules with which the Virtual Library was generated until no additional 
30 whole molecule descriptor remains to be used; and 

u. rq)eating the previous step for another validated molecular descriptor appropriate to 
structural variations with which the Virtual Library was goierated until no additional 
structural variation descriptor remains to be used. 
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90. The method of claims 61 or 63 or 65 or 70 or 71 or 72 or 73 or 74 or 86 in which 
die validated molecular structural descriptor appropriate to structural variations is topomeric 
CoMFA fields. 

91. The method of claim 61 or 63 or 65 or 70 or 71 or 72 or 73 or 74 or 86 in which 
topomeric hydrogen bond fields are used in conjunction with the topomeric CoMFA fields 
descriptor, 

92. The m^hod of claims 63 or 65 or 69 or 71 or 72 or 73 or 74 or 75 or 80 or 86 or 
88 in which the validated molecular structural descriptor appropriate to whole molecules is the 
Tanimoto 2D coefficient. 

93. TTie method of claim 63 in which after stq> g product molecules with the following 
characteristics are removed from further use in the method: 

a. toxic reactant molecules; 

b. reactant molecules containing metals, improper forms of tautomers, and interfering 
. chemical groups; 

c. reactant molecules with too low a bioavailability; 

d. reactant molecules not likely to cross membranes; and 

e. reactant molecules containing biologically non-relevant groups. 

94. The mediod of claim 63 in which after step g product molecules with the following 
characteristics are removed from further use in the method: 

a. product molecules having MW > 750; and 

b. product molecules not having a CLCXiP between -2 and 7.5, 

95. The methods of selecting screening libraries as disclosed in this invention. 

96. The systems for selecting screening libraries as disclosed in this invention. 

97. The screening libraries selected by the methods or systems disclosed in this invention. 

98. The metric validation method as disclosed in this invention. 

99. The method of merging libraries as disclosed in this invention. 

100. The method of lead explosion as disclosed in this invention. 

101. The methods of molecular alignment as disclosed in this invention. 

102. The new molecular structural descriptors as disclosed in this invention. 

103. The methods of generating a virtual library as disclosed in this invention. 

104. The methods of searching a virtual libraiy as disclosed in this invention. 

105. The virtual library as disclosed in this invention. 
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